Paper: An Efficient Indexer for Large N-Gram Corpora

ACL ID P11-4018
Title An Efficient Indexer for Large N-Gram Corpora
Venue Annual Meeting of the Association of Computational Linguistics
Session System Demonstration
Year 2011

We introduce a new publicly available tool that implements efficient indexing and re- trieval of large N-gram datasets, such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and dupli- cate data, it also allows users to issue wild card queries provided that the wild cards in the query are contiguous. Furthermore, we also implement some of the smoothing algorithms that are designed specifically for large datasets and are shown to yield better language mod- els than the traditional ones on the Web1T 5- gram corpus (Yuret, 2008). We demonstrate the effectiveness of our tool and the smooth- ing algorithms on the English Lexical Substi- ...