통계정보에 기반을 둔 한국어 어휘중의성해소

통계정보에 기반을 둔 한국어 어휘중의성해소

ㆍ 저자명: 박하규,김영택
ㆍ 간행물명: 한국통신학회논문지
ㆍ 권/호정보: 1994년|19권 2호|pp.265-275 (11 pages)
ㆍ 발행정보: 한국통신학회
ㆍ 파일정보: 정기간행물|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

어휘중의성 해소는 음성 인식/생성, 정보 검색, 발뭉치 태킹 등 자연언어 처리에서 가장 기초가 되는 분야 중의 하나이다. 본 논문은 말뭉치로부터 추출된 통계정보를 이용하는 한국어 어휘중의성해소 기법에 대해 기술한다. 이 기법에서는 좀더 정밀한 중의성해소를 위해 품사태그 대신 형태소분석 결과에 해당하는 토큰태그를 사용하고 있다. 본 논문에서 제안한 어휘선택함수는 어미나 조사의 호응 관계등 한국어의 어휘적 특성을 잘 반영하기 때문에 상당히 높은 정확성을 보여준다. 그리고 활용분야에 적합하게 사용될 수 있도록 유일선택 방식과 다중선택 방식이라는 두가지 중의성해소 방식을 지원하고 있다.

기타언어초록

Lexical disambiguation is one of the most basic areas in natural language processing such as speech recognition/synthesis, information retrieval, corpus tagging/ etc. This paper describes a Korean lexical disambiguation mechanism where the disambigution is perfoemed on the basis of the statistical information collected from corpora. In this mechanism, the token tags corresponding to the results of the morphological analysis are used instead of part of speech tags for the purpose of detail disambiguation. The lexical selection function proposed shows considerably high accuracy, since the lexical characteristics of Korean such as concordance of endings or postpositions are well reflected in it. Two disambiguation methods, a unique selection method and a multiple selection method, are provided so that they can be properly according to the application areas.

다운URL