Paper: Aligning Sentences In Parallel Corpora

ACL ID P91-1022
Title Aligning Sentences In Parallel Corpora
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 1991

In this paper we describe a statistical tech- nique for aligning sentences with their translations in two parallel corpora. In addition to certain anchor points that are available in our da.ta, the only information about the sentences that we use for calculating alignments is the number of tokens that they contain. Because we make no use of the lexical details of the sentence, the alignment com- putation is fast and therefore practical for appli- cation to very large collections of text. We have used this technique to align several million sen- tences in the English-French Hans~trd corpora and have achieved an accuracy in excess of 99% in a random selected set of 1000 sentence pairs that we checked by hand. We show that even without the benefit of anchor points the correlation between the ...