Paper: Will Pyramids Built Of Nuggets Topple Over?

ACL ID N06-1049
Title Will Pyramids Built Of Nuggets Topple Over?
Venue Human Language Technologies
Session Main Conference
Year 2006

The present methodology for evaluating complex questions at TREC analyzes an- swers in terms of facts called “nuggets”. The official F-score metric represents the harmonic mean between recall and pre- cision at the nugget level. There is an implicit assumption that some facts are more important than others, which is im- plemented in a binary split between “vi- tal” and “okay” nuggets. This distinc- tion holds important implications for the TREC scoring model—essentially, sys- tems only receive credit for retrieving vi- tal nuggets—and is a source of evalua- tion instability. The upshot is that for many questions in the TREC testsets, the median score across all submitted runs is zero. In this work, we introduce a scor- ing model based on judgments from mul- tipleassessorsth...