복합명사를 포함하는 개선된 키워드 추출 방법

이한동; 김종배

서지반출

국문초록

텍스트 마이닝 연구가 활발히 진행되면서 키워드 추출 기법은 문서 검색(Retrieval), 군집화 (Clustering), 문서 분류(Classification), 색인(Indexing) 등의 대량의 문서를 처리하는 텍스트 마이닝 연 구에 중요하게 활용되고 있다. 하지만 기존의 키워드 추출 연구에서는 중요 복합명사를 하나의 키워 드로 추출하기 위해서는 문서의 특성에 맞게 전처리 과정이 필요하다. 또한 웹 문서의 경우 웹 2.0 시대가 도래함으로 일반 사용자가 직접 생성하는 경우가 많은데, 이러한 특성은 문서의 분량 문제와, 분야를 특정 지을 수 없는 문제로 각 문서의 단어사전을 구축에 어려움이 따른다. 때문에 그동안의 키워드 추출 연구에서는 복합명사를 키워드로 추출하기 위한 전처리 과정 없이 키워드 추출하는 연 구를 진행해왔다. 본 논문에서는 각 문서별 전처리 과정 없이 문서 내에서 하나의 키워드로 취급할 수 있는 복합명사를 찾는 연구를 진행한다. 이를 위해 PMI(Point-wise Mutual Information) 값을 활 용하여 문서 내 단어 간 연관성을 찾아낸다. 본 연구 결과는 문서 요약, 주제 탐지 등의 연구에도 개 선된 결과를 이끌어 낼 수 있는 기반이 될 수 있을 것으로 기대된다.

영문초록

As text mining techniques evolve, keyword extraction techniques are critical to text mining research that handles large amounts of documents such as document searches, clustering, classification, indexes, and etc.. However, in an existing keyword extraction study, a pre-treatment process is required to extract the critical compound noun as a single keyword. And in the case of Web documents, since the era of Web 2.0 is coming, many users create their own documents. This is a difficulty in constructing the word dictionary for each document due to the problem of the amount of documents and the problem that the field can not be specified. So existing studies have led to research to extract keyword extraction without a pre-treatment process for the compound noun. In this paper, a study is conducted to find a compound noun that can be treated as a single keyword within a document without pre-treatment to each document. For this purpose, use the PMI(Point-wise Mutual Information) value to find the connection between the words in the document. The results of this study are expected to lead to better outcomes for research, such as document retrieval, document summaries and topical detection.

키워드

텍스트 마이닝 키워드 추출 PMI TF-IDF TextRank

구매하기 (3,000)

장바구니

국문초록

영문초록

목차

키워드