Paper: Mining The Web For Bilingual Text

ACL ID P99-1068
Title Mining The Web For Bilingual Text
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 1999
Authors

STRAND (Resnik, 1998) is a language- independent system for automatic discovery of text in parallel translation on the World Wide Web. This paper extends the prelim- inary STRAND results by adding automatic language identification, scaling up by orders of magnitude, and formally evaluating perfor- mance. The most recent end-product is an au- tomatically acquired parallel corpus comprising 2491 English-French document pairs, approxi- mately 1.5 million words per language.