Paper: What Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages

ACL ID D14-1096
Title What Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2014
Authors

In this paper we address the problem of multilingual part-of-speech tagging for resource-poor languages. We use par- allel data to transfer part-of-speech in- formation from resource-rich to resource- poor languages. Additionally, we use a small amount of annotated data to learn to ?correct? errors from projected approach such as tagset mismatch between lan- guages, achieving state-of-the-art perfor- mance (91.3%) across 8 languages. Our approach is based on modest data require- ments, and uses minimum divergence clas- sification. For situations where no uni- versal tagset mapping is available, we propose an alternate method, resulting in state-of-the-art 85.6% accuracy on the resource-poor language Malagasy.