Paper: Clustering Words With The MDL Principle

ACL ID C96-1003
Title Clustering Words With The MDL Principle
Venue International Conference on Computational Linguistics
Session Main Conference
Year 1996

We address the probhml of automaticMly constructing a thesaurus by clustering words based on corpus data. We view this problem as that of estimating a joint distribution over the (:artesian product of a partition of a set of nouns and a partition of a set of verbs, and propose a learning a.lgorithm based on the Min- inmm Description Length (MDL) Prin- ciple for such estimation. We empiri- cally compared the performance of our method based on the MDL Principle against the Maximum Likelihood Esti- mator in word clustering, and found that the former outperforms the latter. ~¢Ve also evaluated the method by conduct- ing pp-attachment disambiguation ex- periments using an automaticMly con- structed thesaurus. Our experimental results indicate that such a thesaurus can be used to improve accura...