Paper: Verifiably Effective Arabic Dialect Identification

ACL ID D14-1154
Title Verifiably Effective Arabic Dialect Identification
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2014
Authors

Several recent papers on Arabic dialect identi- fication have hinted that using a word unigram model is sufficient and effective for the task. However, most previous work was done on a standard fairly homogeneous dataset of dialec- tal user comments. In this paper, we show that training on the standard dataset does not generalize, because a unigram model may be tuned to topics in the comments and does not capture the distinguishing features of dialects. We show that effective dialect identification requires that we account for the distinguishing lexical, morphological, and phonological phe- nomena of dialects. We show that accounting for such can improve dialect detection accu- racy by nearly 10% absolute.