Paper: A Fast Method for Parallel Document Identification

ACL ID N07-2008
Title A Fast Method for Parallel Document Identification
Venue Human Language Technologies
Session Short Paper
Year 2007
Authors

We present a fast method to identify homogeneous parallel documents. The method is based on collecting counts of identical low-frequency words between possibly parallel documents. The candi- date with the most shared low-frequency words is selected as the parallel document. The method achieved 99.96% accuracy when tested on the EUROPARL corpus of parliamentary proceedings, failing only in anomalous cases of truncated or oth- erwise distorted documents. While other work has shown similar performance on this type of dataset, our approach pre- sented here is faster and does not require training. Apart from proposing an effi- cient method for parallel document iden- tification in a restricted domain, this pa- per furnishes evidence that parliamentary proceedings may be inappropriate for test- ...