Paper: The Automatic Extraction Of Open Compounds From Text Corpora

ACL ID C96-2208
Title The Automatic Extraction Of Open Compounds From Text Corpora
Venue International Conference on Computational Linguistics
Session Main Conference
Year 1996
Authors

This paper describes a new method for extracting open compounds (unin- terrupted sequences of words) from text corpora of languages, such as Thai, Japanese and Korea that exhibit unex- plicit word segmentation. Without ap- plying word segmentation techniques to the inputted plain text, we generate n- gram data from it. We then count the oc- currence of each string and sort them in alphabetical order. It is significant that the frequency of occurrence of strings de, creases when the window size of ob- servation is extended. From the statis- tical point of view, a word is a string with a fixed pattern that is used repeat- edly, meaning that it; shouht occur with a higher frequency than a string that is not a word. We observe the variation of frequency of the sorted n-gram data and extract th...