Paper: Building a Coreference-Annotated Corpus from the Domain of Biochemistry

ACL ID W11-0210
Title Building a Coreference-Annotated Corpus from the Domain of Biochemistry
Venue Workshop on Biomedical Natural Language Processing
Session
Year 2011
Authors

One of the reasons for which the resolution of coreferences has remained a challenging information extraction task, especially in the biomedical domain, is the lack of training data in the form of annotated corpora. In or- der to address this issue, we developed the HANAPIN corpus. It consists of full-text ar- ticles from biochemistry literature, covering entities of several semantic types: chemical compounds, drug targets (e.g., proteins, en- zymes, cell lines, pathogens), diseases, or- ganisms and drug effects. All of the co- referring expressions pertaining to these se- mantic types were annotated based on the an- notation scheme that we developed. We ob- served four general types of coreferences in the corpus: sortal, pronominal, abbreviation and numerical. Using the MASI distance metr...