Paper: Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis

ACL ID P08-1083
Title Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2008
Authors

Morphological disambiguation proceeds in 2 stages: (1) an analyzer provides all possible analyses for a given token and (2) a stochastic disambiguation module picks the most likely analysis in context. When the analyzer does not recognize a given token, we hit the prob- lem of unknowns. In large scale corpora, un- knowns appear at a rate of 5 to 10% (depend- ing on the genre and the maturity of the lexi- con). We address the task of computing the distribu- tion p(t|w) for unknown words for full mor- phological disambiguation in Hebrew. We in- troduce a novel algorithm that is language in- dependent: it exploits a maximum entropy let- ters model trained over the known words ob- served in the corpus and the distribution of the unknown words in known tag contexts, through iterative approximat...