Paper: Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation

ACL ID P08-1086
Title Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2008
Authors

In statistical language modeling, one technique to reduce the problematic effects of data spar- sity is to partition the vocabulary into equiva- lence classes. In this paper we investigate the effects of applying such a technique to higher- order n-gram models trained on large corpora. We introduce a modification of the exchange clustering algorithm with improved efficiency for certain partially class-based models and a distributed version of this algorithm to effi- ciently obtain automatic word classifications for large vocabularies (>1 million words) us- ing such large training corpora (>30 billion to- kens). The resulting clusterings are then used in training partially class-based language mod- els. We show that combining them with word- based n-gram models in the log-linear model of a ...