문서 군집화의 정확률 향상을 위한 범용어 수집과 문서 재분류 알고리즘

문서 군집화의 정확률 향상을 위한 범용어 수집과 문서 재분류 알고리즘

ㆍ 저자명: 신준철,옥철영,이응봉,Shin. Joon-Choul,Ock. Cheol-Young,Lee. Eung-Bong
ㆍ 간행물명: 정보처리학회논문지. The KIPS transactions. Part B. Part B
ㆍ 권/호정보: 2012년|1호|pp.53-62 (10 pages)
ㆍ 발행정보: 한국정보처리학회
ㆍ 파일정보: 정기간행물|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

정보검색에서 많은 검색 결과 문서들을 효율적으로 다루기 위해 군집화 기술을 사용하고 있지만, 대체로 군집화의 정확률은 일부 영역에서만 요구 사항을 만족시키고 있다. 본 논문에서는 검색 결과 문서들의 군집화 정확률을 향상시키기 위한 두 가지 방법을 제안한다. 첫째는 군집화 과정에서 흔히 쓰이지만 낮은 가중치를 가진 범용어를 정의하고, 검색 결과들을 비교하여 범용어를 자동 수집하고 그의 가중치를 계산하는 방법을 제안한다. 실험 결과 불용어에 비해 범용어를 사용했을 때 군집화 오류의 34%가 개선되었다. 둘째는 집단평균연결 방식의 군집화 알고리즘으로 일차 군집들을 생성 후, 문서와 군집 간의 유사도를 측정하여 가장 유사도가 높은 군집으로 문서를 재분류하는 알고리즘을 제안한다. 네이버 지식인 카테고리를 이용한 군집 결과의 비교 실험을 통해 일차 군집보다 재분류된 군집의 정확률이 1.81% 향상되는 것을 확인하였다.

기타언어초록

Clustering technology is used to deal efficiently with many searched documents in information retrieval system. But the accuracy of the clustering is satisfied to the requirement of only some domains. This paper proposes two methods to increase accuracy of the clustering. We define a common-word, that is frequently used but has low weight during clustering. We propose the method that automatically gathers the common-word and calculates its weight from the searched documents. From the experiments, the clustering error rates using the common-word is reduced to 34% compared with clustering using a stop-word. After generating first clusters using average link clustering from the searched documents, we propose the algorithm that reevaluates the similarity between document and clusters and reclassifies the document into more similar clusters. From the experiments using Naver JiSikIn category, the accuracy of reclassified clusters is increased to 1.81% compared with first clusters without reclassification.

키워드

웹 검색 결과 군집화 분류 재분류 불용어 범용어 군집빈도 점진적 군집화 Web Searching Results Clustering Classification Reclassification Stop-Word Common-Word Cluster Frequency Incremental Clustering

다운URL