Paper: Arabic Preprocessing Schemes For Statistical Machine Translation

ACL ID N06-2013
Title Arabic Preprocessing Schemes For Statistical Machine Translation
Venue Human Language Technologies
Session Short Paper
Year 2006
Authors

In this paper, we study the effect of dif- ferent word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of train- ing data, it is best to apply English-like to- kenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a sig- nificant increase in BLEU score if there is a change in genre between training and test data.