Paper: A Figure Of Merit For The Evaluation Of Web-Corpus Randomness

ACL ID E06-1028
Title A Figure Of Merit For The Evaluation Of Web-Corpus Randomness
Venue Annual Meeting of The European Chapter of The Association of Computational Linguistics
Session Main Conference
Year 2006
Authors

In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomness of a collection of documents (corpus), with respect to a number of biased partitions. The method is based on the comparison of the word frequency distribution of the target corpus to word frequency distributions from cor- pora built in deliberately biased ways. We apply the method to the task of building a corpus via queries to Google. Our results indicate that this approach can be used, reliably, to discriminate biased and unbi- ased document collections and to choose the most appropriate query terms.