Paper: Automatic extraction of subcorpora based on subcategorization frames from a part-of-speech tagged corpus

ACL ID C98-1068
Title Automatic extraction of subcorpora based on subcategorization frames from a part-of-speech tagged corpus
Venue International Conference on Computational Linguistics
Session Main Conference
Year 1998
Authors
  • Susanne Gahl (University of California at Berkeley, Berkeley CA)

This paper presents a method for extracting sub.cor.pora documenting different subcate- gorlzatlon frames for verbs, nouns, and adjectives in the 100 mio. word British National Corpus. The extraction tool consists of a set of batch files for use with the Corpus Query Processor (CQP), which is part of the IMS corpus workbench (cf. Christ 1994a,b). A macroprocessor has been developed that allows the user to specify in a simple input file which subcorpora are to be created for a given lemma. The resulting subcorpora can be used (1) to provide evidence for the subcategorization properties of a given lemma, and to facilitate the selection of corpus lines for lexicographic research, and (2) to determine the frequencies of different syntactic contexts of each lemma.