Paper: Urdu Word Segmentation

ACL ID N10-1077
Title Urdu Word Segmentation
Venue Human Language Technologies
Session Main Conference
Year 2010

Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmenta- tion challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper dis- cusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and over- all accuracy of 95.8%. Further issues and possible fu- ture directions are also disc...