Paper: Corroborating Text Evaluation Results with Heterogeneous Measures

ACL ID D11-1042
Title Corroborating Text Evaluation Results with Heterogeneous Measures
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2011
Authors

Automatically produced texts (e.g. transla- tions or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisti- cated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an in- depth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on simi- larity to human-produced references satisfies. These properties imply that corroborating sys- tem improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is mea- surable) the high...