Paper: Words and Echoes: Assessing and Mitigating the Non-Randomness Problem in Word Frequency Distribution Modeling

ACL ID P07-1114
Title Words and Echoes: Assessing and Mitigating the Non-Randomness Problem in Word Frequency Distribution Modeling
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2007
Authors

Frequency distribution models tuned to words and other linguistic events can pre- dict the number of distinct types and their frequency distribution in samples of arbi- trary sizes. We conduct, for the first time, a rigorous evaluation of these models based on cross-validation and separation of train- ing and test data. Our experiments reveal that the prediction accuracy of the models is marred by serious overfitting problems, due to violations of the random sampling as- sumption in corpus data. We then propose a simple pre-processing method to allevi- ate such non-randomness problems. Further evaluation confirms the effectiveness of the method, whichcomparesfavourablytomore complex correction techniques.