Paper: An Empirical Investigation of Discounting in Cross-Domain Language Models

ACL ID P11-2005
Title An Empirical Investigation of Discounting in Cross-Domain Language Models
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2011
Authors

We investigate the empirical behavior of n- gram discounts within and across domains. When a language model is trained and evalu- ated on two corpora from exactly the same do- main, discounts are roughly constant, match- ing the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essen- tially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, result- ing in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.