Paper: Reduced N-Gram Models For English And Chinese Corpora

ACL ID P06-2040
Title Reduced N-Gram Models For English And Chinese Corpora
Venue Annual Meeting of the Association of Computational Linguistics
Session Poster Session
Year 2006
Authors

Statistical language models should improve as the size of the n-grams increases from 3 to 5 or higher. However, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams’ approach previously developed by O’Boyle (1993) can be applied. A reduced n-gram language model can store an entire corpus’s phrase-history length within feasible storage limits. Another theoretical advantage of reduced n-grams is that they are closer to being semantically complete than traditional models, which include all n-grams. In our experiments, the reduced n-gram Zipf curves are first presented, and compared with previously obtained conventional n-grams for both English...