Paper: Automatic Corpus Expansion for Chinese Word Segmentation by Exploiting the Redundancy of Web Information

ACL ID C14-1109
Title Automatic Corpus Expansion for Chinese Word Segmentation by Exploiting the Redundancy of Web Information
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2014
Authors

Currently most of state-of-the-art methods for Chinese word segmentation (CWS) are based on supervised learning, which depend on large scale annotated corpus. However, these supervised methods do not work well when we deal with a new different domain without enough annotated corpus. In this paper, we propose a method to automatically expand the training corpus for the out-of-domain texts by exploiting the redundant in- formation on Web. We break up a complex and uncertain segmentation by resorting to Web for an ample supply of relevant easy-to-segment sentences. Then we can pick out some reliable segmented sentences and add them to corpus. With the augmented corpus, we can re-train a better segmenter to resolve the original complex segmentation. The experimental results show that our appro...