Paper: Morphological Richness Offsets Resource Demand - Experiences In Constructing A POS Tagger For Hindi

ACL ID P06-2100
Title Morphological Richness Offsets Resource Demand - Experiences In Constructing A POS Tagger For Hindi
Venue Annual Meeting of the Association of Computational Linguistics
Session Poster Session
Year 2006
Authors

In this paper we report our work on building a POS tagger for a morpholog- ically rich language- Hindi. The theme of the research is to vindicate the stand that- if morphology is strong and har- nessable, then lack of training corpora is not debilitating. We establish a method- ology of POS tagging which the re- source disadvantaged (lacking annotated corpora) languages can make use of. The methodology makes use of locally an- notated modestly-sized corpora (15,562 words), exhaustive morpohological anal- ysis backed by high-coverage lexicon and a decision tree based learning algo- rithm (CN2). The evaluation of the sys- tem was done with 4-fold cross valida- tion of the corpora in the news domain (www.bbc.co.uk/hindi). The current ac- curacy of POS tagging is 93.45% and can be further impr...