Paper: Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

ACL ID D13-1031
Title Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2013
Authors

Nowadays supervised sequence labeling models can reach competitive performance on the task of Chinese word segmenta- tion. However, the ability of these mod- els is restricted by the availability of an- notated data and the design of features. We propose a scalable semi-supervised fea- ture engineering approach. In contrast to previous works using pre-defined task- specific features with fixed values, we dy- namically extract representations of label distributions from both an in-domain cor- pus and an out-of-domain corpus. We update the representation values with a semi-supervised approach. Experiments on the benchmark datasets show that our approach achieve good results and reach an f-score of 0.961. The feature engineer- ing approach proposed here is a general iterative semi-supervised ...