Paper: Using Twitter to Collect a Multi-Dialectal Corpus of Arabic

ACL ID W14-3601
Title Using Twitter to Collect a Multi-Dialectal Corpus of Arabic
Venue Workshop on Arabic Natural Language Processing
Session
Year 2014
Authors

This paper describes the collection and clas- sification of a multi-dialectal corpus of Ara- bic based on the geographical information of tweets. We mapped information of user lo- cations to one of the Arab countries, and extracted tweets that have dialectal word(s). Manual evaluation of the extracted corpus shows that the accuracy of assignment of tweets to some countries (like Saudi Arabia and Egypt) is above 93% while the accuracy for other countries, such Algeria and Syria is below 70%.