Paper: Exploiting Unlabeled Text to Extract New Words of Different Semantic Transparency for Chinese Word Segmentation

ACL ID I08-2134
Title Exploiting Unlabeled Text to Extract New Words of Different Semantic Transparency for Chinese Word Segmentation
Venue International Joint Conference on Natural Language Processing
Session Main Conference
Year 2008
Authors

This paper exploits unlabeled text data to improve new word identiflcation and Chinese word segmentation performance. Our contributions are twofold. First, for new words that lack semantic trans- parency, such as person, location, or transliteration names, we calculate as- sociation metrics of adjacent character segments on unlabeled data and encode this information as features. Second, we construct an internal dictionary by using an initial model to extract words from both the unlabeled training and test set to maintain balanced coverage on the training and test set. In comparison to the baseline model which only uses n-gram features, our approach increases new word recall up to 6.0%. Addition- ally, our approaches reduce segmenta- tion errors up to 32.3%. Our system achieves state-of-the...