Paper: An Iterative Link-based Method for Parallel Web Page Mining

ACL ID D14-1129
Title An Iterative Link-based Method for Parallel Web Page Mining
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2014
Authors

Abstracts Identifying parallel web pages from bi- lingual web sites is a crucial step of bi- lingual resource construction for cross- lingual information processing. In this paper, we propose a link-based approach to distinguish parallel web pages from bi- lingual web sites. Compared with the ex- isting methods, which only employ the internal translation similarity (such as content-based similarity and page struc- tural similarity), we hypothesize that the external translation similarity is an effec- tive feature to identify parallel web pages. Within a bilingual web site, web pages are interconnected by hyperlinks. The basic idea of our method is that the trans- lation similarity of two pages can be in- ferred from their neighbor pages, which can be adopted as an important sourc...