Paper: Compound Noun Segmentation Based On Lexical Data Extracted From Corpus

ACL ID A00-1027
Title Compound Noun Segmentation Based On Lexical Data Extracted From Corpus
Venue Annual Conference of the North American Chapter of the Association for Computational Linguistics
Session Main Conference
Year 2000
Authors
  • Juntae Yoon (University of Pennsylvania, Philadelphia PA)

Compound noun analysis is one of the crucial prob- lems in Korean language processing because a series of nouns in Korean may appear without white space in real texts, which makes it difficult to identify the morphological constituents. This paper presents an effective method of Korean compound noun segmen- tation based on lexical data extracted from corpus. The segmentation is done by two steps: First, it is based on manually constructed built-in dictionary for segmentation whose data were extracted from 30 million word corpus. Second, a segmentation algo- rithm using statistical data is proposed, where sim- ple nouns and their frequencies are also extracted from corpus. The analysis is executed based on CYK tabular parsing and min-max operation. By exper- iments, its accuracy is about 97...