Paper: A Stochastic Process For Word Frequency Distributions

ACL ID P91-1035
Title A Stochastic Process For Word Frequency Distributions
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 1991
  • R. Harald Baayen (Max Planck Institute for Psycholinguistics, Nijmegen The Netherlands)

A stochastic model based on insights of Man- delbrot (1953) and Simon (1955) is discussed against the background of new criteria of ade- quacy that have become available recently as a result of studies of the similarity relations be- tween words as found in large computerized text corpora. FREQUENCY DISTRIBUTIONS Various models for word frequency distributions have been developed since Zipf (1935) applied the zeta distribution to describe a wide range of lexical data. Mandelbrot (1953, 1962)extended Zipf's distribution 'law' K f, = ?x, (i) where fi is the sample frequency of the i th type in a ranking according to decreasing frequency, with the parameter B, K f~ = B + i~ ' (2) by means of which fits are obtained that are more accurate with respect to the higher frequency words. Simon (1955...