Paper: The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content

ACL ID P11-2007
Title The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2011
Authors

The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialec- tal content, and we describe our long-term an- notation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of au- tomatic dialect identification, using the col- lected labels for training and evaluation.