Paper: Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model

ACL ID D08-1053
Title Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2008
Authors

Parallel web pages are important source of training data for statistical machine translation. In this paper, we present a new approach to sentence alignment on parallel web pages. Parallel web pages tend to have parallel structures,and the structural correspondence can be indica- tive information for identifying parallel sentences. In our approach, the web page is represented as a tree, and a stochastic tree alignment model is used to exploit the structural correspondence for sentence alignment. Experiments show that this method significantly enhances alignment accuracy and robustness for parallel web pages which are much more diverse and noisy than standard parallel corpora such as “Hansard”. With improved sentence alignment performance, web mining sys- tems are ab...