Paper: A Stochastic Finite-State Word-Segmentation Algorithm For Chinese

ACL ID P94-1010
Title A Stochastic Finite-State Word-Segmentation Algorithm For Chinese
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 1994
Authors

We present a stochastic finite-state model for segment- ing Chinese text into dictionary entries and produc- tively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single seg- mentation. THE PROBLEM The initial step of any text analysis task is the tok- enization of the input into words. For many writing systems, using whitespace as a delimiter for words yields reasonable results. However, for Chinese and other systems where whitespace is not used to delimit words, such trivial schemes will not work. Chinese writing is morphosyllabic (DeFrancis, 1984), meaning that each hanzi- 'Chinese charac...