Paper: A PAC-Bayesian Approach to Minimum Perplexity Language Modeling

ACL ID C14-1014
Title A PAC-Bayesian Approach to Minimum Perplexity Language Modeling
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2014
Authors

Despite the overwhelming use of statistical language models in speech recognition, machine translation, and several other domains, few high probability guarantees exist on their generaliza- tion error. In this paper, we bound the test set perplexity of two popular language models ? the n-gram model and class-based n-grams ? using PAC-Bayesian theorems for unsupervised learn- ing. We extend the bound to sequence clustering, wherein classes represent longer context such as phrases. The new bound is dominated by the maximum number of sequences represented by each cluster, which is polynomial in the vocabulary size. We show that we can still encourage small sample generalization by sparsifying the cluster assignment probabilities. We incorporate our bound into an efficient HMM-based sequence c...