Paper: N-gram Weighting: Reducing Training Data Mismatch in Cross-Domain Language Model Estimation

ACL ID D08-1087
Title N-gram Weighting: Reducing Training Data Mismatch in Cross-Domain Language Model Estimation
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2008
Authors

In domains with insufficient matched training data, language models are often constructed by interpolating component models trained from partially matched corpora. Since the n- grams from such corpora may not be of equal relevance to the target domain, we propose an n-gram weighting technique to adjust the component n-gram probabilities based on fea- tures derived from readily available segmen- tation and metadata information for each cor- pus. Using a log-linear combination of such features, the resulting model achieves up to a 1.2% absolute word error rate reduction over a linearly interpolated baseline language model on a lecture transcription task.