Paper: Reversing Morphological Tokenization in English-to-Arabic SMT

ACL ID N13-2007
Title Reversing Morphological Tokenization in English-to-Arabic SMT
Venue Annual Conference of the North American Chapter of the Association for Computational Linguistics
Session Student Session
Year 2013

Morphological tokenization has been used in machine translation for morphologically complex languages to reduce lexical sparsity. Unfortunately, when translating into a mor- phologically complex language, recombining segmented tokens to generate original word forms is not a trivial task, due to morpho- logical, phonological and orthographic adjust- ments that occur during tokenization. We re- view a number of detokenization schemes for Arabic, such as rule-based and table-based ap- proaches and show their limitations. We then propose a novel detokenization scheme that uses a character-level discriminative string transducer to predict the original form of a segmented word. In a comparison to a state- of-the-art approach, we demonstrate slightly better detokenization error rates, without the...