Paper: Performance Prediction for Exponential Language Models

ACL ID N09-1051
Title Performance Prediction for Exponential Language Models
Venue Human Language Technologies
Session Main Conference
Year 2009

We investigate the task of performance pre- diction for language models belonging to the exponential family. First, we attempt to em- pirically discover a formula for predicting test set cross-entropy for n-gram language mod- els. We build models over varying domains, data set sizes, and n-gram orders, and perform linear regression to see whether we can model test set performance as a simple function of training set performance and various model statistics. Remarkably, we find a simple rela- tionship that predicts test set performance with a correlation of 0.9997. We analyze why this relationship holds and show that it holds for other exponential language models as well, in- cluding class-based models and minimum dis- crimination information models. Finally, we discuss how this relationshi...