ACL Anthology Network (All About NLP) (beta) The Association Of Computational Linguistics Anthology Network |
ACL ID | C14-1071 |
---|---|
Title | Unsupervised Multiword Segmentation of Large Corpora using Prediction-Driven Decomposition of n-grams |
Venue | International Conference on Computational Linguistics |
Session | Main Conference |
Year | 2014 |
Authors |
We present a new, efficient unsupervised approach to the segmentation of corpora into multiword units. Our method involves initial decomposition of common n-grams into segments which max- imize within-segment predictability of words, and then further refinement of these segments into a multiword lexicon. Evaluating in four large, distinct corpora, we show that this method cre- ates segments which correspond well to known multiword expressions; our model is particularly strong with regards to longer (3+ word) multiword units, which are often ignored or minimized in relevant work.