Paper: Syntactic Annotations for the Google Books NGram Corpus

ACL ID P12-3029
Title Syntactic Annotations for the Google Books NGram Corpus
Venue Annual Meeting of the Association of Computational Linguistics
Session System Demonstration
Year 2012
Authors

We present a new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages; it reflects 6% of all books ever published. This new edi- tion introduces syntactic annotations: words are tagged with their part-of-speech, and head- modifier relationships are recorded. The an- notations are produced automatically with sta- tistical models that are specifically adapted to historical text. The corpus will facilitate the study of linguistic trends, especially those re- lated to the evolution of syntax.