Paper: An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)

ACL ID P07-2011
Title An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)
Venue Annual Meeting of the Association of Computational Linguistics
Session System Demonstration
Year 2007
Authors

Gorman and Curran (2006) argue that the- saurus generation for billion+-word corpora is problematic as the full computation takes many days. We present an algorithm with which the computation takes under two hours. We have created, and made pub- licly available, thesauruses based on large corpora for (at time of writing) seven major world languages. The development is imple- mented in the Sketch Engine (Kilgarriff et al. , 2004). Another innovative development in the same tool is the presentation of the grammatical behaviour of a word against the background of how all other words of the same word class behave. Thus, the English noun con- straint occurs 75% in the plural. Is this a salient lexical fact? To form a judge- ment, we need to know the distribution for all nouns. We use histograms...