Paper: ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

ACL ID P12-3016
Title ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora
Venue Annual Meeting of the Association of Computational Linguistics
Session System Demonstration
Year 2012
Authors

The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bi- or multi-lingual text resources) which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extraction from comparable corpora. It consists of tools bundled in two workflows: (1) alignment of comparable documents and extraction of parallel sentences and (2) extraction and bilingual mapping of terms and named entities. The toolkit pairs similar bilingual comparable documents and extracts parallel sentences and bilingual terminological and named entity dictionaries from comparable...