Paper: Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

ACL ID D10-1026
Title Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2010
Authors

We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in con- stant time, at speeds comparable to modern language modeling toolkits. Our basic ap- proach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics ofn-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more effi- cient storage. We apply our approach to stor- ing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword...