Paper: Randomised Language Modelling for Statistical Machine Translation

ACL ID P07-1065
Title Randomised Language Modelling for Statistical Machine Translation
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2007
Authors

A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements are significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiableprobability. Hereweexplorethe use of BFs for language modelling in statis- tical machine translation. We show how a BF containing n-grams can enable us to use much larger corpora and higher-order models complementing a con- ventional n-gram LM within an SMT sys- tem. We also consider (i) how to include ap- proximate frequency information efficiently within a BF and (ii) how to reduce the er- ror rate of these models by first checking for lower-order sub-sequences in candidate n- grams. Our solutions in both cases retain the one-sided error guarantees of the BF while tak...