코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로

코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로
The Statistical Relationship between Linguistic Items and Corpus Size

ㆍ 저자명: 양경숙,박병선
ㆍ 간행물명: 언어와 정보
ㆍ 권/호정보: 2003년|7권 2호|pp.103-115 (13 pages)
ㆍ 발행정보: 한국언어정보학회
ㆍ 파일정보: 정기간행물|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

In recent years, many organizations have been constructing their own large corpora to achieve corpus representativeness. However, there is no reliable guideline as to how large corpus resources should be compiled, especially for Korean corpora. In this study, we have contrived a new statistical model, ARIMA (Autoregressive Integrated Moving Average), for predicting the relationship between linguistic items (the number of types) and corpus size (the number of tokens), overcoming the major flaws of several previous researches on this issue. Finally, we shall illustrate that the ARIMA model presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, corpus representativeness and linguistic comprehensiveness.

키워드

타입수와 토큰수 간의 관계 1억 어절 코퍼스 규모 ARIMA 모형 모의실험 빈도 relationship between the number of types and the number of tokens 100 million word corpus corpus size ARIMA model simulation frequency

다운URL