Paper: Unsupervised Morphology-Based Vocabulary Expansion

ACL ID P14-1127
Title Unsupervised Morphology-Based Vocabulary Expansion
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014

We present a novel way of generating un- seen words, which is useful for certain ap- plications such as automatic speech recog- nition or optical character recognition in low-resource languages. We test our vo- cabulary generator on seven low-resource languages by measuring the decrease in out-of-vocabulary word rate on a held-out test set. The languages we study have very different morphological properties; we show how our results differ depend- ing on the morphological complexity of the language. In our best result (on As- samese), our approach can predict 29% of the token-based out-of-vocabulary with a small amount of unlabeled training data.