Paper: Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus

ACL ID W14-3612
Title Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus
Venue Workshop on Arabic Natural Language Processing
Session
Year 2014
Authors

This paper describes the process of creating a novel resource, a parallel Arabizi-Arabic script corpus of SMS/Chat data. The lan- guage used in social media expresses many differences from other written genres: its vo- cabulary is informal with intentional devia- tions from standard orthography such as re- peated letters for emphasis; typos and non- standard abbreviations are common; and non- linguistic content is written out, such as laughter, sound representations, and emoti- cons. This situation is exacerbated in the case of Arabic social media for two reasons. First, Arabic dialects, commonly used in so- cial media, are quite different from Modern Standard Arabic phonologically, morphologi- cally and lexically, and most importantly, they lack standard orthographies. Second, ...