Paper: Maximum Entropy Based Restoration Of Arabic Diacritics

ACL ID P06-1073
Title Maximum Entropy Based Restoration Of Arabic Diacritics
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2006

Short vowels and other diacritics are not part of written Arabic scripts. Exceptions are made for important political and reli- gious texts and in scripts for beginning stu- dents of Arabic. Script without diacritics have considerable ambiguity because many words with different diacritic patterns ap- pear identical in a diacritic-less setting. We propose in this paper a maximum entropy approach for restoring diacritics in a doc- ument. The approach can easily integrate and make effective use of diverse types of information; the model we propose inte- grates a wide array of lexical, segment- based and part-of-speech tag features. The combination of these feature types leads to a state-of-the-art diacritization model. Using a publicly available corpus (LDC’s Arabic Treebank Part 3), we ach...