Paper: Arabizi Detection and Conversion to Arabic

ACL ID W14-3629
Venue Workshop on Arabic Natural Language Processing
Year 2014

Arabizi is Arabic text that is written using Latin characters. Arabizi is used to present both Mod- ern Standard Arabic (MSA) or Arabic dialects. It is commonly used in informal settings such as so- cial networking sites and is often with mixed with English. In this paper we address the problems of: identifying Arabizi in text and converting it to Ara- bic characters. We used word and sequence-level features to identify Arabizi that is mixed with En- glish. We achieved an identification accuracy of 98.5%. As for conversion, we used transliteration mining with language modeling to generate equiva- lent Arabic text. We achieved 88.7% conversion ac- curacy, with roughly a third of errors being spelling and morphological variants of the forms in ground truth.