Paper: Corpus Effects on the Evaluation of Automated Transliteration Systems

ACL ID P07-1081
Title Corpus Effects on the Evaluation of Automated Transliteration Systems
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2007
Authors

Most current machine transliteration sys- tems employ a corpus of known source- target word pairs to train their system, and typically evaluate their systems on a similar corpus. In this paper we explore the perfor- mance of transliteration systems on corpora that are varied in a controlled way. In partic- ular, we control the number, and prior lan- guage knowledge of human transliterators used to construct the corpora, and the origin of the source words that make up the cor- pora. We find that the word accuracy of au- tomated transliteration systems can vary by up to 30% (in absolute terms) depending on the corpus on which they are run. We con- clude that at least four human transliterators should be used to construct corpora for eval- uating automated transliteration systems; and that al...