Paper: A Multi-layer Chinese Word Segmentation System Optimized for Out-of-domain Tasks

ACL ID W10-4127
Title A Multi-layer Chinese Word Segmentation System Optimized for Out-of-domain Tasks
Venue Joint Conference on Chinese Language Processing
Session Main Conference
Year 2010
Authors

State-of-the-art Chinese word segmenta- tion systems have achieved high perfor- mance when training data and testing data are from the same domain. However, they suffer from the generalizability problem when applied on test data from different domains. We introduce a multi-layer Chi- nese word segmentation system which can integrate the outputs from multiple hetero- geneous segmentation systems. By train- ing a second layer of large margin clas- sifier on top of the outputs from several Conditional Random Fields classifiers, it can utilize a small amount of in-domain training data to improve the performance. Experimental results show consistent im- provement on F1 scores and OOV recall rates by applying the approach.