Paper: Syntax Annotation for the GENIA Corpus

ACL ID I05-2038
Title Syntax Annotation for the GENIA Corpus
Venue International Joint Conference on Natural Language Processing
Session poster-demo-tutorial
Year 2005
Authors
  • Yuka Tateisi (CREST Japan Science and Technology Corporation, Saitama Japan)
  • Akane Yakushiji (University of Tokyo, Tokyo Japan)
  • Tomoko Ohta (University of Tokyo, Tokyo Japan; University of Manchester, Manchester UK; CREST Japan Science and Technology Corporation, Saitama Japan)
  • Jun'ichi Tsujii

Linguistically annotated corpus based on texts in biomedical domain has been constructed to tune natural language processing (NLP) tools for bio- textmining. As the focus of information extraction is shifting from "nominal" information such as named entity to "verbal" information such as function and interaction of substances, applica- tion of parsers has become one of the key technologies and thus the corpus annotated for syntactic structure of sen- tences is in demand. A subset of the GENIA corpus consisting of 500 MEDLINE abstracts has been anno- tated for syntactic structure in an XML- based format based on Penn Treebank II (PTB) scheme. Inter-annotator agreement test indicated that the writ- ing style rather than the contents of the research abstracts is the source of the difficulty i...