Paper: Improving A Simple Bigram HMM Part-of-Speech Tagger by Latent Annotation and Self-Training

ACL ID N09-2054
Title Improving A Simple Bigram HMM Part-of-Speech Tagger by Latent Annotation and Self-Training
Venue Human Language Technologies
Session Short Paper
Year 2009
Authors

In this paper, we describe and evaluate a bi- gram part-of-speech (POS) tagger that uses latent annotations and then investigate using additional genre-matched unlabeled data for self-training the tagger. The use of latent annotations substantially improves the per- formance of a baseline HMM bigram tag- ger, outperforming a trigram HMM tagger with sophisticated smoothing. The perfor- mance of the latent tagger is further enhanced by self-training with a large set of unlabeled data, even in situations where standard bigram or trigram taggers do not benefit from self- training when trained on greater amounts of labeled training data. Our best model obtains a state-of-the-art Chinese tagging accuracy of 94.78% when evaluated on a representative test set of the Penn Chinese Treebank 6.0.