Paper: An Overview of Microsoft Web N-gram Corpus and Applications

ACL ID N10-2012
Title An Overview of Microsoft Web N-gram Corpus and Applications
Venue Human Language Technologies
Session System Demonstration
Year 2010
Authors

This document describes the properties and some applications of the Microsoft Web Ngram corpus. The corpus is designed to have the following characteristics. First, in contrast to static data distribution of previous corpus releases, this N-gram corpus is made publicly available as an XML Web Service so that it can be updated as deemed necessary by the user community to include new words and phrases constantly being added to the Web. Secondly, the corpus makes available various sections of a Web document, specifically, the body, title, and anchor text, as separates models as text contents in these sections are found to possess significantly different statistical properties and therefore are treated as distinct languages from the language modeling point of view. The usages of the corpus are...