Paper: Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data

ACL ID P98-2206
Title Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 1998
Authors

Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between characters, is derived automatically from raw Chinese corpora. The preliminary experiment shows that the segmentation accuracy of our algorithm is acceptable. We hope the gaining of this approach will be beneficial to improving the perfomaance(especially in ability to cope with unknown words and ability to adapt to various domains) of the existing segmenters, though the algorithm itself can also be utilized as a stand-alone segmenter in some NLP applications.