Abstract
There is an increasing trend in using neural methods for dialogue model evaluation. Lack of a framework to investigate these metrics can cause dialogue models to reflect their biases and cause unforeseen problems during interactions. In this work, we propose an adversarial test-suite which generates problematic variations of various dialogue aspects, e.g. logical entailment, using automatic heuristics. We show that dialogue metrics for both open-domain and task-oriented settings are biased in their assessments of different conversation behaviors and fail to properly penalize problematic conversations, by analyzing their assessments of these problematic examples. We conclude that variability in training methodologies and data-induced biases are some of the main causes of these problems. We also conduct an investigation into the metric behaviors using a black-box interpretability model which corroborates our findings and provides evidence that metrics pay attention to the problematic conversational constructs signaling a misunderstanding of different conversation semantics.- Anthology ID:
- 2022.naacl-main.430
- Volume:
- Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Month:
- July
- Year:
- 2022
- Address:
- Seattle, United States
- Editors:
- Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5871–5883
- Language:
- URL:
- https://aclanthology.org/2022.naacl-main.430
- DOI:
- 10.18653/v1/2022.naacl-main.430
- Cite (ACL):
- Baber Khalid and Sungjin Lee. 2022. Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5871–5883, Seattle, United States. Association for Computational Linguistics.
- Cite (Informal):
- Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis (Khalid & Lee, NAACL 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2022.naacl-main.430.pdf
- Data
- MultiNLI