Paper: Indexing Google 1T for low-turnaround wildcarded frequency queries

ACL ID N12-2004
Title Indexing Google 1T for low-turnaround wildcarded frequency queries
Venue Annual Conference of the North American Chapter of the Association for Computational Linguistics
Session Student Session
Year 2012
Authors

We propose a technique to prepare the Google 1T n-gram data set for wildcarded frequency queries with a very low turnaround time, mak- ing unbatched applications possible. Our method supports token-level wildcarding and ? given a cache of 3.3 GB of RAM ? requires only a single read of less than 4 KB from the disk to answer a query. We present an index- ing structure, a way to generate it, and sug- gestions for how it can be tuned to particular applications.