Paper: Towards a Data Model for the Universal Corpus

ACL ID W11-1216
Title Towards a Data Model for the Universal Corpus
Venue Building and Using Comparable Corpora
Year 2011

We describe the design of a comparable cor- pus that spans all of the world’s languages and facilitates large-scale cross-linguistic process- ing. This Universal Corpus consists of text collections aligned at the document and sen- tence level, multilingual wordlists, and a small setofmorphological,lexical,andsyntactican- notations. The design encompasses submis- sion, storage, and access. Submission pre- serves the integrity of the work, allows asyn- chronous updates, and facilitates scholarly ci- tation. Storage employs a cloud-hosted file- store containing normalized source data to- gether with a database of texts and annota- tions. Access is permitted to the filestore, the database, and an application programming in- terface. All aspects of the Universal Corpus are open, and we invite...