Paper: Web Text Corpus For Natural Language Processing

ACL ID E06-1030
Title Web Text Corpus For Natural Language Processing
Venue Annual Meeting of The European Chapter of The Association of Computational Linguistics
Session Main Conference
Year 2006

Web text has been successfully used as training data for many NLP applications. While most previous work accesses web text through search engine hit counts, we created a Web Corpus by downloading web pages to create a topic-diverse collec- tion of 10 billion words of English. We show that for context-sensitive spelling correction the Web Corpus results are bet- ter than using a search engine. For the- saurus extraction, it achieved similar over- all results to a corpus of newspaper text. With many more words available on the web, better results can be obtained by col- lecting much larger web corpora.