Paper: Modelling the Lexicon in Unsupervised Part of Speech Induction

ACL ID E14-1013
Title Modelling the Lexicon in Unsupervised Part of Speech Induction
Venue Annual Meeting of The European Chapter of The Association of Computational Linguistics
Session Main Conference
Year 2014
Authors

Automatically inducing the syntactic part- of-speech categories for words in text is a fundamental task in Computational Linguistics. While the performance of unsupervised tagging models has been slowly improving, current state-of-the-art systems make the obviously incorrect as- sumption that all tokens of a given word type must share a single part-of-speech tag. This one-tag-per-type heuristic coun- ters the tendency of Hidden Markov Model based taggers to over generate tags for a given word type. However, it is clearly incompatible with basic syntactic theory. In this paper we extend a state-of- the-art Pitman-Yor Hidden Markov Model tagger with an explicit model of the lexi- con. In doing so we are able to incorpo- rate a soft bias towards inducing few tags per type. We develop a partic...