Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs
Abishek Komma, Nagesh Panyam Chandrasekarasastry, Timothy Leffel, Anuj Goyal, Angeliki Metallinou, Spyros Matsoukas, Aram Galstyan
Abstract
Measurement of interaction quality is a critical task for the improvement of large-scale spoken dialog systems. Existing approaches to dialog quality estimation either focus on evaluating the quality of individual turns, or collect dialog-level quality measurements from end users immediately following an interaction. In contrast to these approaches, we introduce a new dialog-level annotation workflow called Dialog Quality Annotation (DQA). DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment. In this contribution, we show that: (i) while dialog quality cannot be completely decomposed into dialog-level attributes, there is a strong relationship between some objective dialog attributes and judgments of dialog quality; (ii) for the task of dialog-level quality estimation, a supervised model trained on dialog-level annotations outperforms methods based purely on aggregating turn-level features; and (iii) the proposed evaluation model shows better domain generalization ability compared to the baselines. On the basis of these results, we argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.- Anthology ID:
- 2023.acl-industry.19
- Volume:
- Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Sunayana Sitaram, Beata Beigman Klebanov, Jason D Williams
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 186–195
- Language:
- URL:
- https://aclanthology.org/2023.acl-industry.19
- DOI:
- 10.18653/v1/2023.acl-industry.19
- Cite (ACL):
- Abishek Komma, Nagesh Panyam Chandrasekarasastry, Timothy Leffel, Anuj Goyal, Angeliki Metallinou, Spyros Matsoukas, and Aram Galstyan. 2023. Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 186–195, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs (Komma et al., ACL 2023)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2023.acl-industry.19.pdf