Paper: Modeling of term-distance and term-occurrence information for improving n-gram language model performance

ACL ID P13-2042
Title Modeling of term-distance and term-occurrence information for improving n-gram language model performance
Venue Annual Meeting of the Association of Computational Linguistics
Session Short Paper
Year 2013
Authors

In this paper, we explore the use of distance and co-occurrence information of word-pairs for language modeling. We attempt to extract this information from history-contexts of up to ten words in size, and found it complements well the n-gram model, which inherently suf- fers from data scarcity in learning long histo- ry-contexts. Evaluated on the WSJ corpus, bi- gram and trigram model perplexity were re- duced up to 23.5% and 14.0%, respectively. Compared to the distant bigram, we show that word-pairs can be more effectively modeled in terms of both distance and occurrence.