Paper: A Statistical Method For Extracting Uninterrupted And Interrupted Collocations From Very Large Corpora

ACL ID C96-1097
Title A Statistical Method For Extracting Uninterrupted And Interrupted Collocations From Very Large Corpora
Venue International Conference on Computational Linguistics
Session Main Conference
Year 1996
Authors

In order to extract rigid expressions with a high fre- quency of use, new algorithm that can efficiently extract both uninterrupted and interrupted collocations from very large corpora has been proposed. The statistical method recently proposed for calculating N-gram of m'bitrary N can be applied to the extraction of uninterrupted collocations. But this method posed pro- blems that so large volumes of fractional and unnecessary expressions are extracted that it was impossible to extract interrupted collocations combining the results. To solve this problem, this paper proposed a new algorithm that restrains extraction of unnecessary substrings. This is followed by the proposal of a method that enable to extract interrupted collocations. The new methods are applied to Japanese newspaper arti...