Paper: Using Collections of Human Language Intuitions to Measure Corpus Representativeness

ACL ID C14-1200
Title Using Collections of Human Language Intuitions to Measure Corpus Representativeness
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2014
Authors

In corpus linguistics there have been numerous attempts to compile balanced corpora, result- ing in text collections such as the Brown Corpus or the British National Corpus. These cor- pora are meant to reflect the average language use a native speaker typically encounters. But is it possible to measure in how far these efforts were successful? Assuming that humans? lan- guage intuitions are based on our brain?s capability to statistically analyze perceived language and to memorize these statistics, we suggest a method for measuring corpus representative- ness which compares corpus statistics to three types of human language intuitions as collected from test persons: Word familiarity, word association, and word relatedness. We compute a representativeness score for a corpus by extracti...