Paper: Extracting and Classifying Urdu Multiword Expressions

ACL ID P11-3005
Title Extracting and Classifying Urdu Multiword Expressions
Venue Annual Meeting of the Association of Computational Linguistics
Session Student Session
Year 2011
Authors

This paper describes a method for automati- cally extracting and classifying multiword ex- pressions (MWEs) for Urdu on the basis of a relatively small unannotated corpus (around 8.12 million tokens). The MWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and per- son names. The classification is based on sim- ple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of 0.5 and 0.746 for locations and persons, respectively. A target application is the Urdu ParGram grammar, where MWEs are needed to generate a more precise syntactic and semantic analysis.