Paper: A Human Judgement Corpus and a Metric for Arabic MT Evaluation

ACL ID D14-1026
Title A Human Judgement Corpus and a Metric for Arabic MT Evaluation
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2014
Authors

We present a human judgments dataset and an adapted metric for evaluation of Arabic machine translation. Our medium- scale dataset is the first of its kind for Ara- bic with high annotation quality. We use the dataset to adapt the BLEU score for Arabic. Our score (AL-BLEU) provides partial credits for stem and morphologi- cal matchings of hypothesis and reference words. We evaluate BLEU, METEOR and AL-BLEU on our human judgments cor- pus and show that AL-BLEU has the high- est correlation with human judgments. We are releasing the dataset and software to the research community.