Paper: Perplexity on Reduced Corpora

ACL ID P14-1075
Title Perplexity on Reduced Corpora
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014

This paper studies the idea of remov- ing low-frequency words from a corpus, which is a common practice to reduce computational costs, from a theoretical standpoint. Based on the assumption that a corpus follows Zipf?s law, we derive trade- off formulae of the perplexity of k-gram models and topic models with respect to the size of the reduced vocabulary. In ad- dition, we show an approximate behavior of each formula under certain conditions. We verify the correctness of our theory on synthetic corpora and examine the gap be- tween theory and practice on real corpora.