Paper: The MILE Corpus For Less Commonly Taught Languages

ACL ID N06-2002
Title The MILE Corpus For Less Commonly Taught Languages
Venue Human Language Technologies
Session Short Paper
Year 2006

This paper describes a small, struc- tured English corpus that is designed for translation into Less Commonly Taught Languages (LCTLs), and a set of re-usable tools for creation of similar cor- pora. 1 The corpus systematically explores meanings that are known to affect morphology or syntax in the world’s languages. Each sentence is associated with a feature structure showing the elements of meaning that are represented in the sentence. The corpus is highly structured so that it can support machine learning with only a small amount of data. As part of the REFLEX program, the corpus will be translated into multiple LCTLs, resulting in paral- lel corpora can be used for training of MT and other language technolo- gies. Only the untranslated English corpus is described in this paper.