Paper: Constructing Transliteration Lexicons From Web Corpora

ACL ID P04-3003
Title Constructing Transliteration Lexicons From Web Corpora
Venue Annual Meeting of the Association of Computational Linguistics
Session System Demonstration
Year 2004
Authors
  • Jin-Shea Kuo (Chunghwa Telecom Co., Ltd., Chungli Taiwan; National Taiwan University of Science and Technology, Taiwan)
  • Ying-Kuei Yang (National Taiwan University of Science and Technology, Taiwan)

This paper proposes a novel approach to automating the construction of transliterated-term lexicons. A simple syllable alignment algorithm is used to construct confusion matrices for cross-language syllable-phoneme conversion. Each row in the confusion matrix consists of a set of syllables in the source language that are (correctly or erroneously) matched phonetically and statistically to a syllable in the target language. Two conversions using phoneme-to-phoneme and text-to-phoneme syllabification algorithms are automatically deduced from a training corpus of paired terms and are used to calculate the degree of similarity between phonemes for transliterated-term extraction. In a large-scale experiment using this automated learning process for conversions, more than 200,000 transliterated-...