Paper: NTU-MC Toolkit: Annotating a Linguistically Diverse Corpus

ACL ID C14-2019
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2014

The NTU-MC Toolkit is a compilation of tools to annotate the Nanyang Technological University - Multilingual Corpus (NTU-MC). The NTU-MC is a parallel corpora of linguistically diverse languages (Arabic, English, Indonesian, Japanese, Korean, Mandarin Chinese, Thai and Viet- namese). The NTU-MC thrives on the mantra of "more data is better data and more annotation is better information". Other than increasing parallel data from diverse language pairs, annotat- ing the corpus with various layers of information allows corpora linguists to discover linguistic phenomena and provides computational linguists with pre-annotated features for various NLP tasks. In addition to the agglomeration existing tools into a single python wrapper library, we have implemented three tools (Mini-segmenter, GaCh...