Paper: Multi-Class Composite N-Gram Language Model For Spoken Language Processing Using Multiple Word Clusters

ACL ID P01-1068
Title Multi-Class Composite N-Gram Language Model For Spoken Language Processing Using Multiple Word Clusters
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2001
Authors

In this paper, a new language model, the Multi-Class Composite N-gram, is pro- posed to avoid a data sparseness prob- lem for spoken language in that it is difficult to collect training data. The Multi-Class Composite N-gram main- tains an accurate word prediction ca- pability and reliability for sparse data with a compact model size based on multiple word clusters, called Multi- Classes. In the Multi-Class, the statisti- cal connectivity at each position of the N-grams is regarded as word attributes, and one word cluster each is created to represent the positional attributes. Fur- thermore, by introducing higher order word N-grams through the grouping of frequent word successions, Multi-Class N-grams are extended to Multi-Class Composite N-grams. In experiments, the Multi-Class Composite ...