Paper: How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives

ACL ID W11-1211
Title How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives
Venue Building and Using Comparable Corpora
Session
Year 2011
Authors

In this paper, we question the homogeneity of a large parallel corpus by measuring the similarity between various sub-parts. We compare results obtained using a general measure of lexical similarity based on χ2 and by counting the number of discourse connectives. We argue that discourse connectives provide a more sensitive measure, revealing differences that are not visible with the general measure. We also provide evidence for the existence of specific characteristics defining translated texts as opposed to non- translated ones, due to a universal tendency for explicitation.