Paper: Automatic Corpus-Based Thai Word Extraction With The C4.5 Learning Algorithm

ACL ID C00-2116
Title Automatic Corpus-Based Thai Word Extraction With The C4.5 Learning Algorithm
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2000
Authors

"Word" is difficult to define in the languages that do not exhibit explicit word boundary, such as Thai. Traditional methods on defining words for this kind of languages have to depend on human judgement which bases on unclear criteria o1" procedures, and have several limitations. This paper proposes an algorithm for word extraction from Thai texts without borrowing a hand from word segmentation. We employ the c4.5 learning algorithm for this task. Several attributes such as string length, frequency, nmtual information and entropy are chosen for word/non-word determination. Our experiment yields high precision results about 85% in both training and test corpus.