Paper: Mining Bilingual Data from the Web with Adaptively Learnt Patterns

ACL ID P09-1098
Title Mining Bilingual Data from the Web with Adaptively Learnt Patterns
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2009
Authors

Mining bilingual data (including bilingual sen- tences and terms1) from the Web can benefit many NLP applications, such as machine translation and cross language information re- trieval. In this paper, based on the observation that bilingual data in many web pages appear collectively following similar patterns, an adaptive pattern-based bilingual data mining method is proposed. Specifically, given a web page, the method contains four steps: 1) pre- processing: parse the web page into a DOM tree and segment the inner text of each node into snippets; 2) seed mining: identify poten- tial translation pairs (seeds) using a word based alignment model which takes both trans- lation and transliteration into consideration; 3) pattern learning: learn generalized patterns with the iden...