Paper: Different Structures for Evaluating Answers to Complex Questions: Pyramids Won't Topple and Neither Will Human Assessors

ACL ID P07-1097
Title Different Structures for Evaluating Answers to Complex Questions: Pyramids Won't Topple and Neither Will Human Assessors
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2007
Authors
  • Hoa Trang Dang (National Institute of Standards and Technology, Gaithersburg MD)
  • Jimmy Lin (University of Maryland, College Park MD)

The idea of “nugget pyramids” has re- cently been introduced as a refinement to the nugget-based methodology used to evaluate answers to complex questions in the TREC QA tracks. This paper examines data from the 2006 evaluation, the first large-scale de- ployment of the nugget pyramids scheme. We show that this method of combining judgments of nugget importance from multi- ple assessors increases the stability and dis- criminativepoweroftheevaluationwhilein- troducing only a small additional burden in terms of manual assessment. We also con- sider an alternative method for combining assessor opinions, which yields a distinction similar to micro- and macro-averaging in the context of classification tasks. While the two approaches differ in terms of underly- ing assumptions, their result...