Combining Multiple Metrics for Evaluating Retrieval-Augmented Conversations

Jason Ingyu Choi, Marcus Collins, Eugene Agichtein, Oleg Rokhlenko, Shervin Malmasi


Abstract
Conversational AI is a subtype of Human Computer Interaction that has gained wide adoption. These systems are typically powered by Large Language Models (LLMs) that use Retrieval Augmented Generation (RAG) to infuse external knowledge, which is effective against issues like hallucination. However, automatically evaluating retrieval augmented conversations with minimal human effort remains challenging, particularly in online settings. We address this challenge by proposing a lexical metric, and a novel method for combining it with other metrics, including semantic models. Our approach involves: (1) Conversational Information Utility (CIU), a new automated metric inspired by prior user studies on web search evaluation, to compute information overlap between conversation context and grounded information in an unsupervised, purely lexical way; and (2) a generalized reward model through Mixture-of-Experts (MoE-CIU) that dynamically ensembles CIU with other metrics, including learned ones, into a single reward. Evaluation against human ratings on two public datasets (Topical Chat and Persona Chat) shows that CIU improves correlation against human judgments by 2.0% and 0.9% respectively compared to the second best metric. When MoE is applied to combine lexical and learned semantic metrics, correlations further improve by 9.9% and 5.0%, suggesting that unified reward models are a promising approach.
Anthology ID:
2024.hcinlp-1.4
Volume:
Proceedings of the Third Workshop on Bridging Human--Computer Interaction and Natural Language Processing
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Su Lin Blodgett, Amanda Cercas Curry, Sunipa Dev, Michael Madaio, Ani Nenkova, Diyi Yang, Ziang Xiao
Venues:
HCINLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
40–50
Language:
URL:
https://aclanthology.org/2024.hcinlp-1.4
DOI:
10.18653/v1/2024.hcinlp-1.4
Bibkey:
Cite (ACL):
Jason Ingyu Choi, Marcus Collins, Eugene Agichtein, Oleg Rokhlenko, and Shervin Malmasi. 2024. Combining Multiple Metrics for Evaluating Retrieval-Augmented Conversations. In Proceedings of the Third Workshop on Bridging Human--Computer Interaction and Natural Language Processing, pages 40–50, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Combining Multiple Metrics for Evaluating Retrieval-Augmented Conversations (Choi et al., HCINLP-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-2024-clasp/2024.hcinlp-1.4.pdf