Paper: EM Decipherment for Large Vocabularies

ACL ID P14-2123
Title EM Decipherment for Large Vocabularies
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014

This paper addresses the problem of EM- based decipherment for large vocabular- ies. Here, decipherment is essentially a tagging problem: Every cipher token is tagged with some plaintext type. As with other tagging problems, this one can be treated as a Hidden Markov Model (HMM), only here, the vocabularies are large, so the usual O(NV 2 ) exact EM ap- proach is infeasible. When faced with this situation, many people turn to sam- pling. However, we propose to use a type of approximate EM and show that it works well. The basic idea is to collect fractional counts only over a small subset of links in the forward-backward lattice. The sub- set is different for each iteration of EM. One option is to use beam search to do the subsetting. The second method restricts the successor words that are ...