Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis

Baber Khalid; Sungjin Lee

doi:10.18653/v1/2022.naacl-main.430

Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis

Abstract

There is an increasing trend in using neural methods for dialogue model evaluation. Lack of a framework to investigate these metrics can cause dialogue models to reflect their biases and cause unforeseen problems during interactions. In this work, we propose an adversarial test-suite which generates problematic variations of various dialogue aspects, e.g. logical entailment, using automatic heuristics. We show that dialogue metrics for both open-domain and task-oriented settings are biased in their assessments of different conversation behaviors and fail to properly penalize problematic conversations, by analyzing their assessments of these problematic examples. We conclude that variability in training methodologies and data-induced biases are some of the main causes of these problems. We also conduct an investigation into the metric behaviors using a black-box interpretability model which corroborates our findings and provides evidence that metrics pay attention to the problematic conversational constructs signaling a misunderstanding of different conversation semantics.

Anthology ID:: 2022.naacl-main.430
Volume:: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:: July
Year:: 2022
Address:: Seattle, United States
Editors:: Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5871–5883
Language:
URL:: https://aclanthology.org/2022.naacl-main.430
DOI:: 10.18653/v1/2022.naacl-main.430
Bibkey:
Cite (ACL):: Baber Khalid and Sungjin Lee. 2022. Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5871–5883, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis (Khalid & Lee, NAACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-2/2022.naacl-main.430.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-2/2022.naacl-main.430.mp4
Data: MultiNLI

PDF Search Video