BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues

Haodong Duan; Jueqi Wei; Chonghua Wang; Hongwei Liu; Yixiao Fang; Songyang Zhang; Dahua Lin; Kai Chen

BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues

Haodong Duan, Jueqi Wei, Chonghua Wang, Hongwei Liu, Yixiao Fang, Songyang Zhang, Dahua Lin, Kai Chen

Abstract

In the realm of modern Large Language Models (LLMs), facilitating high-quality, multi-turn dialogues with humans represents a cornerstone feature. However, human-based evaluation of such a capability involves substantial manual effort. This study offers a formative assessment of current LLMs’ proficiency in emulating human-like, multi-turn conversations using an LLM-centric approach. The evaluation encompasses three key elements in the evaluation pipeline: utterance generation, evaluation protocol, and judgement, and we delve deeply into each aspect. GPT-4, both as an utterance generator and as a judge, exhibits exceptional performance. As a generator, GPT-4 crafts dialogues indistinguishable from human interactions in terms of style and flow. When judging, it shows a heightened alignment with human evaluative standards and consistency. Conversely, other LLMs face challenges in producing quality multi-turn dialogues, hindered by inadequate instruction-following abilities, a propensity for prolix utterances, and overall limited capabilities. Notably, generating extensive dialogues (e.g., spanning tens of turns) remains a formidable task for most LLMs, particularly in Chinese contexts. We hope that our work can serve as a valuable resource for evaluating the multi-turn chatting capabilities of LLMs. Related resources are available at https://github.com/open-compass/BotChat.

Anthology ID:: 2024.findings-naacl.201
Volume:: Findings of the Association for Computational Linguistics: NAACL 2024
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3184–3200
Language:
URL:: https://aclanthology.org/2024.findings-naacl.201
DOI:
Bibkey:
Cite (ACL):: Haodong Duan, Jueqi Wei, Chonghua Wang, Hongwei Liu, Yixiao Fang, Songyang Zhang, Dahua Lin, and Kai Chen. 2024. BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3184–3200, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues (Duan et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl24-info/2024.findings-naacl.201.pdf