Paper: Assembling the Kazakh Language Corpus

ACL ID D13-1104
Title Assembling the Kazakh Language Corpus
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2013

This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and con- veying five stylistic genres: literary, publicistic, of- ficial, scientific and informal. Along with its pri- mary part KLC comprises such parts as: (i) anno- tated sub-corpus, containing segmented documents encoded in the eXtensible Markup Language (XML) that marks complete morphological, syntactic, and structural characteristics of texts; (ii) as well as a sub-corpus with the annotated speech data. KLC has a web-based corpus management system that helps to navigate the data and retrieve necessary informa- tion. KLC is also open for contributors, who...