Paper: langid.py: An Off-the-shelf Language Identification Tool

ACL ID P12-3005
Title langid.py: An Off-the-shelf Language Identification Tool
Venue Annual Meeting of the Association of Computational Linguistics
Session System Demonstration
Year 2012
Authors

We present langid.py, an off-the-shelf lan- guage identification tool. We discuss the de- sign and implementation of langid.py, and provide an empirical comparison on 5 long- document datasets, and 2 datasets from the mi- croblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without want- ing to invest in preparation of in-domain train- ing data.