Paper: Language Identification: The Long and the Short of the Matter

ACL ID N10-1027
Title Language Identification: The Long and the Short of the Matter
Venue Human Language Technologies
Session Main Conference
Year 2010
Authors

Language identification is the task of identify- ing the language a given document is written in. This paper describes a detailed examina- tion of what models perform best under dif- ferentconditions, basedonexperimentsacross three separate datasets and a range of tokeni- sation strategies. We demonstrate that the task becomes increasingly difficult as we increase the number of languages, reduce the amount of training data and reduce the length of docu- ments. We also show that it is possible to per- formlanguageidentificationwithouthavingto perform explicit character encoding detection.