Paper: Chinese Word Segmentation with Conditional Support Vector Inspired Markov Models

ACL ID W10-4130
Title Chinese Word Segmentation with Conditional Support Vector Inspired Markov Models
Venue Joint Conference on Chinese Language Processing
Session Main Conference
Year 2010
Authors

Character-based tagging method has achieved great success in Chinese Word Segmentation (CWS). This paper proposes a new approach to improve the CWS tagging accuracy by structured support vector machine (SVM) utilization of unlabeled text corpus. First, character N-grams in unlabeled text corpus are mapped into low-dimensional space by adopting SOM algorithm. Then new features extracted from these maps and another kind of feature based on entropy for each N-gram are integrated into the structured SVM methods for CWS. We took part in two tracks of the Word Segmentation for Simplified Chinese Text in bakeoff-2010: Closed track and Open track. The test corpora cover four domains: Literature, Computer Science, Medicine and Finance. Our system achieved good performance, esp...