Paper: Reformatting Web Documents Via Header Trees

ACL ID P05-3031
Title Reformatting Web Documents Via Header Trees
Venue Annual Meeting of the Association of Computational Linguistics
Session System Demonstration
Year 2005
Authors

We propose a new method for reformat- ting web documents by extracting seman- tic structures from web pages. Our ap- proach is to extract trees that describe hier- archical relations in documents. We devel- oped an algorithm for this task by employ- ing the EM algorithm and clustering tech- niques. Preliminary experiments showed that our approach was more effective than baseline methods.