Paper: Zero-shot Entity Extraction from Web Pages

ACL ID P14-1037
Title Zero-shot Entity Extraction from Web Pages
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014
Authors

In order to extract entities of a fine-grained category from semi-structured data in web pages, existing information extraction sys- tems rely on seed examples or redundancy across multiple web pages. In this paper, we consider a new zero-shot learning task of extracting entities specified by a natural language query (in place of seeds) given only a single web page. Our approach de- fines a log-linear model over latent extrac- tion predicates, which select lists of enti- ties from the web page. The main chal- lenge is to define features on widely vary- ing candidate entity lists. We tackle this by abstracting list elements and using aggre- gate statistics to define features. Finally, we created a new dataset of diverse queries and web pages, and show that our system achieves significantly ...