Paper: Zipf's Law and Statistical Data on Modern Tibetan

ACL ID C14-1032
Title Zipf's Law and Statistical Data on Modern Tibetan
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2014

In this paper, a large scale modern Tibetan text corpus is built, which includes about 190 thou- sands documents, 67.21 million words, 93.66 million syllables in total. Based on the corpus, statistics are made in several language units in different granularities. Statistical data show that : a syllable has 3.26 letters or 2.20 super characters in average, while a sentence has 75.40 let- ters or 63.14 super characters. The top 10 super characters, syllables, words take up 66.3156%, 16.5556%, 24.6415% of the corpus respectively. Curves for the n-gram frequency-rank list of super chars, syllables and words are plotted. It shows that when all the n-gram phrases for n = 1, 2, . . . , 5 are put together and sorted by frequency in descending order, the frequency-rank curves in log-log axes can be...