Paper: Evaluating The Evaluation: A Case Study Using The TREC 2002 Question Answering Track

ACL ID N03-1034
Title Evaluating The Evaluation: A Case Study Using The TREC 2002 Question Answering Track
Venue Human Language Technologies
Session Main Conference
Year 2003
Authors
  • Ellen M. Voorhees (National Institute of Standards and Technology, Gaithersburg MD)

Evaluating competing technologies on a com- mon problem set is a powerful way to improve the state of the art and hasten technology trans- fer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. Thus it is important to ex- amine the quality of a new evaluation task to es- tablish its reliability. This paper provides an ex- ample of one such assessment by analyzing the task within the TREC 2002 question answer- ing track. The analysis demonstrates that com- parative results from the new task are stable, and empirically estimates the size of the dif- ference required between scores to confidently conclude that two runs are different. Metric-based evaluations of human language technol- ogy such as MUC and TREC and DUC continue to pro...