Paper: A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art

ACL ID P13-2024
Title A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art
Venue Annual Meeting of the Association of Computational Linguistics
Session Short Paper
Year 2013
Authors

How good are automatic content metrics for news summary evaluation? Here we provide a detailed answer to this question, with a particular focus on assessing the ability of automatic evaluations to identify statistically significant differences present in manual evaluation of content. Using four years of data from the Text Analysis Conference, we analyze the performance of eight ROUGE variants in terms of ac- curacy, precision and recall in finding sig- nificantly different systems. Our exper- iments show that some of the neglected variants of ROUGE, based on higher or- der n-grams and syntactic dependencies, are most accurate across the years; the commonly used ROUGE-1 scores find too many significant differences between systems which manual evaluation would deem comparable. We also test c...