Paper: Mining for Domain-specific Parallel Text from Wikipedia

ACL ID W13-2514
Title Mining for Domain-specific Parallel Text from Wikipedia
Venue Building and Using Comparable Corpora
Year 2013

Previous attempts in extracting parallel data from Wikipedia were restricted by the monotonicity constraint of the alignment algorithm used for matching possible can- didates. This paper proposes a method for exploiting Wikipedia articles without wor- rying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similar- ity criteria. Moreover, we limit the search space to a specific topical domain, since our final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enoug...