Paper: Text Genre Detection Using Common Word Frequencies

ACL ID C00-2117
Title Text Genre Detection Using Common Word Frequencies
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2000
Authors

In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most frequent words in a training corpus (Burrows, 1992). In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written language. Using as testing ground a part of the Wall Street Journal corpus, we show that the most frequent words of the British National Corpus, representing the most frequent words of the written English language, are more reliable discriminators of text genre in comparison to the most frequent words of the training corpus. Moreover, the fi'equencies of occurrence of the most common punctuation marks...