Paper: A New Method Of N-Gram Statistics For Large Number Of N And Automatic Extraction Of Words And Phrases From Large Text Data Of Japanese

ACL ID C94-1101
Title A New Method Of N-Gram Statistics For Large Number Of N And Automatic Extraction Of Words And Phrases From Large Text Data Of Japanese
Venue International Conference on Computational Linguistics
Session Main Conference
Year 1994
Authors

In the process of establish in g the it, form ation the- ory, C. F,. Shannon prol)ose.d the Markov I)ro(:ess as a good model to characterize ~t natural la.nguage. The core or this ide.a is t;o cah:ula.te the ['re(lU('Ii- des of strings compose(l of 'n characters ('n-grams), but this statistical analysis of large text. (lata a. ,id for a large n lilts llever be(HI carried ()tit })eca./ise of the memory limitation of (:omputer and the short- age of text data. Taking advantage of the recent powerful computers we developed a. new aJgorithm of n-grams of large text data for arbitr~try hu'ge 'n a,nd (:alculated successl'ully, within,'ela, tiv(.ly short thlle~ n-grams of some Japa,nese text (la, t~t con- taining between two an(l thirty million chara,(:ters. From this exl)eriment it 1)ecame (:loa,...