- 코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로
- The Statistical Relationship between Linguistic Items and Corpus Size
- ㆍ 저자명
- 양경숙,박병선
- ㆍ 간행물명
- 언어와 정보
- ㆍ 권/호정보
- 2003년|7권 2호|pp.103-115 (13 pages)
- ㆍ 발행정보
- 한국언어정보학회
- ㆍ 파일정보
- 정기간행물| PDF텍스트
- ㆍ 주제분야
- 기타
In recent years, many organizations have been constructing their own large corpora to achieve corpus representativeness. However, there is no reliable guideline as to how large corpus resources should be compiled, especially for Korean corpora. In this study, we have contrived a new statistical model, ARIMA (Autoregressive Integrated Moving Average), for predicting the relationship between linguistic items (the number of types) and corpus size (the number of tokens), overcoming the major flaws of several previous researches on this issue. Finally, we shall illustrate that the ARIMA model presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, corpus representativeness and linguistic comprehensiveness.