Paper: Why Doesn't EM Find Good HMM POS-Taggers?

ACL ID D07-1031
Title Why Doesn't EM Find Good HMM POS-Taggers?
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2007
  • Mark Johnson (Microsoft Research, Redmond WA; Brown University, Providence RI)

This paper investigates why the HMMs es- timated by Expectation-Maximization (EM) produce such poor results as Part-of-Speech (POS) taggers. We find that the HMMs es- timated by EM generally assign a roughly equal number of word tokens to each hid- den state, while the empirical distribution of tokens to POS tags is highly skewed. This motivates a Bayesian approach using a sparse prior to bias the estimator toward such a skewed distribution. We investigate Gibbs Sampling (GS) and Variational Bayes (VB) estimators and show that VB con- verges faster than GS for this task and that VBsignificantlyimproves1-to-1taggingac- curacy over EM. We also show that EM does nearly as well as VB when the number of hidden HMM states is dramatically reduced. We also point out the high variance in all of the...