Paper: A Chinese Word Segmentation System Based on Structured Support Vector Machine Utilization of Unlabeled Text Corpus

ACL ID W10-4129
Title A Chinese Word Segmentation System Based on Structured Support Vector Machine Utilization of Unlabeled Text Corpus
Venue Joint Conference on Chinese Language Processing
Session Main Conference
Year 2010
Authors

We have participated in the open tracks and closed tracks on four corpora of Chi- nese word segmentation tasks in CIPS- SIGHAN-2010 Bake-offs. In our experi- ments, we used the Chinese inner phonol- ogy information in all tracks. For open tracks, we proposed a double hidden lay- ers’ HMM (DHHMM) in which Chinese inner phonology information was used as one hidden layer and the BIO tags as an- other hidden layer. N-best results were firstly generated by using DHHMM, then the best one was selected by using a new lexical statistic measure. For close tracks, we used CRF model in which the Chinese inner phonology information was used as features.