Towards Best Experiment Design for Evaluating Dialogue System Output

Sashank Santhanam, Samira Shaikh


Abstract
To overcome the limitations of automated metrics (e.g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence. While it has been demonstrated that human judgments can suffer from the inconsistency of ratings, extant research has also found that the design of the evaluation task affects the consistency and quality of human judgments. We conduct a between-subjects study to understand the impact of four experiment conditions on human ratings of dialogue system output. In addition to discrete and continuous scale ratings, we also experiment with a novel application of Best-Worst scaling to dialogue evaluation. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as time taken to complete the task and no prior experience of participating in similar studies of rating dialogue system output positively impact consistency and agreement amongst raters.
Anthology ID:
W19-8610
Volume:
Proceedings of the 12th International Conference on Natural Language Generation
Month:
October–November
Year:
2019
Address:
Tokyo, Japan
Editors:
Kees van Deemter, Chenghua Lin, Hiroya Takamura
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
88–94
Language:
URL:
https://aclanthology.org/W19-8610
DOI:
10.18653/v1/W19-8610
Bibkey:
Cite (ACL):
Sashank Santhanam and Samira Shaikh. 2019. Towards Best Experiment Design for Evaluating Dialogue System Output. In Proceedings of the 12th International Conference on Natural Language Generation, pages 88–94, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
Towards Best Experiment Design for Evaluating Dialogue System Output (Santhanam & Shaikh, INLG 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/W19-8610.pdf
Code
 sashank06/INLG_eval