Paper: Enhancing Chinese Word Segmentation Using Unlabeled Data

ACL ID D11-1090
Title Enhancing Chinese Word Segmentation Using Unlabeled Data
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2011

This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discrimina- tive learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addi- tion, we introduce the idea about transductive, document-level segmentation, which is de- signed to improve the system recall for out-of- vocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.