Paper: Customizing Parallel Corpora At The Document Level

ACL ID P04-3005
Title Customizing Parallel Corpora At The Document Level
Venue Annual Meeting of the Association of Computational Linguistics
Session System Demonstration
Year 2004

Recent research in cross-lingual information retrieval (CLIR) established the need for properly matching the parallel corpus used for query translation to the target corpus. We propose a document-level approach to solving this problem: building a custom-made parallel corpus by automatically assembling it from documents taken from other parallel corpora. Although the general idea can be applied to any application that uses parallel corpora, we present results for CLIR in the medical domain. In order to extract the best- matched documents from several parallel corpora, we propose ranking individual documents by using a length-normalized Okapi-based similarity score between them and the target corpus. This ranking allows us to discard 50-90% of the training data, while avoiding the performanc...