Paper: Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap

ACL ID D07-1049
Title Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2007
Authors

A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements fall significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiable probability. Here we present a general framework for deriving smoothed language model probabilities from BFs. We investigate how a BF containing n-gram statistics can be used as a direct replacement for a conventional n-gram model. Recent work has demonstrated that corpus statistics can be stored efficiently within a BF, here we consider how smoothed language model probabilities can be derived efficiently from this randomised representation. Our pro- posal takes advantage of the one-sided error guarantees of the BF and simple inequali- ties that hold between related n...