정보검색 기술을 이용한 비지도 학습 기반 문서 분류 시스템 개발

정보검색 기술을 이용한 비지도 학습 기반 문서 분류 시스템 개발

ㆍ 저자명: 노대욱,이수용,나동열,Noh. Dae-Wook,Lee. Soo-Yong,Ra. Dong-Yul
ㆍ 간행물명: 정보과학회논문지. Journal of KIISE. 소프트웨어 및 응용
ㆍ 권/호정보: 2007년|34권 2호|pp.160-168 (9 pages)
ㆍ 발행정보: 한국정보과학회
ㆍ 파일정보: 정기간행물|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

문서분류기의 개발에 있어 지도학습기법을 이용할 경우 많은 양의 사람에 의한 범주 부착 말뭉치가 필요하다. 그러나 이의 구축은 많은 시간과 노력을 필요로 한다. 최근 이러한 범주 부착 말뭉치 대신 원시말뭉치와 범주마다 약간의 씨앗 정보를 이용하여 학습을 수행하여 문서분류기를 개발하는 방법론이 제시되었다. 본 논문에서는 이 방법론 하에서 다른 연구에서의 결과보다 좋은 성능을 나타내는 비지도 학습 기법을 소개한다. 본 논문에서 제시하는 기법의 특징은 씨앗 단어에서 출발하여 평균상호정보를 이용하여 다른 대표단어 및 그들의 가중치를 학습한 다음, 정보검색에서 많이 사용하는 기술을 이용하여 그 가중치를 갱신하는 것이다. 그리고 이 과정을 반복 수행하여 최종적으로 높은 성능의 시스템을 개발 할 수 있음을 제시하였다.

기타언어초록

For developing a text classifier using supervised learning, a manually labeled corpus of large size is required. However, it takes a lot of time and human effort. Recently a research paradigm was proposed to use a raw corpus and a small amount of seed information instead of manually labeled corpus. In this paper we introduce an unsupervised learning method that makes it possible to achieve better performance than other related works. The characteristics of our approach is that average mutual information is used to learn representative words and their weights and then update of the weights is done using a technique inspired by the works in information retrieval. By iterating this teaming process it was shown that a high performance system can be developed.

키워드

문서분류 비지도학습 대표단어 상호정보 정보검색 Text classification unsupervised learning representative words mutual information information retrieval

다운URL