Paper: SeedLing: Building and Using a Seed corpus for the Human Language Project

ACL ID W14-2211
Title SeedLing: Building and Using a Seed corpus for the Human Language Project
Venue Workshop on the Use of Computational Methods in the Study of Endangered Languages
Session
Year 2014
Authors

A broad-coverage corpus such as the Hu- man Language Project envisioned by Ab- ney and Bird (2010) would be a powerful resource for the study of endangered lan- guages. Existing corpora are limited in the range of languages covered, in stan- dardisation, or in machine-readability. In this paper we present SeedLing, a seed corpus for the Human Language Project. We first survey existing efforts to compile cross-linguistic resources, then describe our own approach. To build the foundation text for a Universal Corpus, we crawl and clean texts from several web sources that contain data from a large number of lan- guages, and convert them into a standard- ised form consistent with the guidelines of Abney and Bird (2011). The result- ing corpus is more easily-accessible and machine-readable than ...