Paper: The International Corpus of Arabic: Compilation, Analysis and Evaluation

ACL ID W14-3602
Title The International Corpus of Arabic: Compilation, Analysis and Evaluation
Venue Workshop on Arabic Natural Language Processing
Session
Year 2014
Authors

This paper focuses on a project for building the first International Corpus of Arabic (ICA). It is planned to contain 100 million analyzed tokens with an interface which al- lows users to interact with the corpus data in a number of ways [ICA website]. ICA is a representative corpus of Arabic that has been initiated in 2006, it is intended to cover the Modern Standard Arabic (MSA) language as being used all over the Arab world. ICA has been analyzed by Bibliotheca Alexandrina Morphological Analysis Enhancer (BAM- AE). BAMAE is based on Buckwalter Arabic Morphological Analyzer (BAMA). Precision and Recall are the evaluation measures used to evaluate the BAMAE system. At this point, Precision measurement ranges from 95%-92% while recall measurement was 92%-89%. This depends o...