Paper: Unsupervised Segmentation Of Words Using Prior Distributions Of Morph Length And Frequency

ACL ID P03-1036
Title Unsupervised Segmentation Of Words Using Prior Distributions Of Morph Length And Frequency
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2003
Authors
  • Mathias Creutz (Helsinki University of Technology, Helsinki Finland)

We present a language-independent and unsupervised algorithm for the segmenta- tion of words into morphs. The algorithm is based on a new generative probabilis- tic model, which makes use of relevant prior information on the length and fre- quency distributions of morphs in a lan- guage. Our algorithm is shown to out- perform two competing algorithms, when evaluated on data from a language with agglutinative morphology (Finnish), and to perform well also on English data.