Paper: Unsupervised Multiword Segmentation of Large Corpora using Prediction-Driven Decomposition of n-grams

ACL ID C14-1071
Title Unsupervised Multiword Segmentation of Large Corpora using Prediction-Driven Decomposition of n-grams
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2014
Authors

We present a new, efficient unsupervised approach to the segmentation of corpora into multiword units. Our method involves initial decomposition of common n-grams into segments which max- imize within-segment predictability of words, and then further refinement of these segments into a multiword lexicon. Evaluating in four large, distinct corpora, we show that this method cre- ates segments which correspond well to known multiword expressions; our model is particularly strong with regards to longer (3+ word) multiword units, which are often ignored or minimized in relevant work.