Paper: The Human Language Project: Building a Universal Corpus of the World’s Languages

ACL ID P10-1010
Title The Human Language Project: Building a Universal Corpus of the World’s Languages
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2010
Authors

We present a grand challenge to build a corpus that will include all of the world’s languages, in a consistent structure that permits large-scale cross-linguistic pro- cessing, enabling the study of universal linguistics. The focal data types, bilin- gual texts and lexicons, relate each lan- guage to one of a set of reference lan- guages. We propose that the ability to train systems to translate into and out of a given language be the yardstick for determin- ing when we have successfully captured a language. We call on the computational linguistics community to begin work on this Universal Corpus, pursuing the many strands of activity described here, as their contribution to the global effort to docu- ment the world’s linguistic heritage before more languages fall silent.