Paper: Non-linear Mapping for Improved Identification of 1300+ Languages

ACL ID D14-1069
Title Non-linear Mapping for Improved Identification of 1300+ Languages
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2014
Authors

Non-linear mappings of the form P (ngram) ? and log(1+?P (ngram)) log(1+?) are applied to the n-gram probabilities in five trainable open-source language identifiers. The first mapping reduces classification errors by 4.0% to 83.9% over a test set of more than one million 65-character strings in 1366 languages, and by 2.6% to 76.7% over a subset of 781 languages. The second mapping improves four of the five identifiers by 10.6% to 83.8% on the larger corpus and 14.4% to 76.7% on the smaller corpus. The subset corpus and the modified programs are made freely available for download at http://www.cs.cmu.edu/?ralf/langid.html.