Paper: Unsupervized Word Segmentation: the Case for Mandarin Chinese

ACL ID P12-2075
Title Unsupervized Word Segmentation: the Case for Mandarin Chinese
Venue Annual Meeting of the Association of Computational Linguistics
Session Short Paper
Year 2012
Authors

In this paper, we present an unsupervized seg- mentation system tested on Mandarin Chi- nese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii's (2005) reformulation, we base our work on the Variation of Branching Entropy. We improve on (Jin and Tanaka-Ishii, 2006) by adding normalization and viterbi- decoding. This enable us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results (Wang et al., 2011) with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II (Emerson, 2005) and define a more precise topline for the task using cross-trained supervized system available off-the-shelf (Zhang and Clark, 2010; Zhao and Kit, 2008; Huang and Zhao, 2007)