Paper: Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data

ACL ID C98-2201
Title Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data
Venue International Conference on Computational Linguistics
Session Main Conference
Year 1998
Authors

Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algoritlml for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between characters, is derived automatically from raw Chinesc corpora. The preliminary experiment shows that the segmentation accuracy of our algorithm is acceptable. We hope the gaining of this approach will be beneficial to improving the performance(especially in ability to cope with unkamwn words and ability to adapt to various domains) of the existing segmenters, though the algorithm itself can also be utilized as a stand-alone segmenter in some NLP applicatio...