Paper: Comparing Automatic Evaluation Measures for Image Description

ACL ID P14-2074
Title Comparing Automatic Evaluation Measures for Image Description
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014
Authors

Image description is a new natural lan- guage generation task, where the aim is to generate a human-like description of an im- age. The evaluation of computer-generated text is a notoriously difficult problem, how- ever, the quality of image descriptions has typically been measured using unigram BLEU and human judgements. The focus of this paper is to determine the correlation of automatic measures with human judge- ments for this task. We estimate the correla- tion of unigram and Smoothed BLEU, TER, ROUGE-SU4, and Meteor against human judgements on two data sets. The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements.