Paper: Assessing Dialog System User Simulation Evaluation Measures Using Human Judges

ACL ID P08-1071
Title Assessing Dialog System User Simulation Evaluation Measures Using Human Judges
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2008
Authors

Previous studies evaluate simulated dialog corpora using evaluation measures which can be automatically extracted from the dialog systems’ logs. However, the validity of these automatic measures has not been fully proven. In this study, we first recruit human judges to assess the quality of three simulated dia- log corpora and then use human judgments as the gold standard to validate the conclu- sions drawn from the automatic measures. We observe that it is hard for the human judges to reach good agreement when asked to rate the quality of the dialogs from given perspec- tives. However, the human ratings give con- sistent ranking of the quality of simulated cor- pora generated by different simulation mod- els. When building prediction models of hu- man judgments using previously proposed...