Paper: A DOM Tree Alignment Model For Mining Parallel Data From The Web

ACL ID P06-1062
Title A DOM Tree Alignment Model For Mining Parallel Data From The Web
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2006
Authors

This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model (DOM), a web page is represented as a DOM tree. Then a DOM tree alignment model is proposed to identify the transla- tionally equivalent texts and hyperlinks between two parallel DOM trees. By tracing the identified parallel hyperlinks, parallel web documents are recursively mined. Compared with previous mining schemes, the benchmarks show that this new mining scheme improves the mining coverage, reduces mining bandwidth, and enhances the quality of mined parallel sentences.