Paper: Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

ACL ID P13-2032
Title Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
Venue Annual Meeting of the Association of Computational Linguistics
Session Short Paper
Year 2013
Authors

Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is avail- able to train Chinese word segmentation model, existing Chinese word segmenta- tion tools cannot perform equally well as in ordinary news texts. In this pa- per we present an effective yet simple ap- proach to Chinese word segmentation of micro-blog. In our approach, we incor- porate punctuation information of unla- beled micro-blog data by introducing char- acters behind or ahead of punctuations, for they indicate the beginning or end of words. Meanwhile a self-training frame- work to incorporate confident instances is also used, which prove to be helpful. Ex- periments on micro-blog data show that our approach improves performance, espe- cially in OOV-recall.