Paper: Subword-Based Tagging For Confidence-Dependent Chinese Word Segmentation

ACL ID P06-2123
Title Subword-Based Tagging For Confidence-Dependent Chinese Word Segmentation
Venue Annual Meeting of the Association of Computational Linguistics
Session Poster Session
Year 2006
Authors
  • Ruiqiang Zhang (NTT Cyber Space Laboratories, Kanagawa Japan)
  • Genichiro Kikui (National Institute of Information and Communications Technology, Kyoto Japan; ATR Spoken Language Communication Research Laboratories, Kyoto Japan)
  • Eiichiro Sumita

We proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy (MaxEnt) and the conditional random fields (CRF) methods. We found that the proposed subword-based tagging outperformed the character-based tagging in all compara- tive experiments. In addition, we pro- posed a confidence measure approach to combine the results of a dictionary-based and a subword-tagging-based segmenta- tion. This approach can produce an ideal tradeoff between the in-vocaulary rate and out-of-vocabulary rate. Our tech- niques were evaluated using the test data from Sighan Bakeoff 2005. We achieved higher F-scores than the best results in three of the four corpora: PKU(0.951), CITYU(0.950) and ...