Paper: The Web As A Baseline: Evaluating The Performance Of Unsupervised Web-Based Models For A Range Of NLP Tasks

ACL ID N04-1016
Title The Web As A Baseline: Evaluating The Performance Of Unsupervised Web-Based Models For A Range Of NLP Tasks
Venue Human Language Technologies
Session Main Conference
Year 2004
Authors

Previous work demonstrated that web counts can be used to approximate bigram frequen- cies, and thus should be useful for a wide va- riety of NLP tasks. So far, only two gener- ation tasks (candidate selection for machine translation and confusion-set disambiguation) have been tested using web-scale data sets. The present paper investigates if these results gener- alize to tasks covering both syntax and seman- tics, both generation and analysis, and a larger range of n-grams. For the majority of tasks, we find that simple, unsupervised models perform better when n-gram frequencies are obtained from the web rather than from a large corpus. However, in most cases, web-based models fail to outperform more sophisticated state-of-the- art models trained on small corpora. We ar- gue that web-bas...