Simple Agents, Biased Judges: Efficient Multi-Party Dialogue Generation & The Evaluation Gap

Kunal Samanta; Faisal Tareque Shohan; Amine Trabelsi; Richard Khoury

Simple Agents, Biased Judges: Efficient Multi-Party Dialogue Generation & The Evaluation Gap

Kunal Samanta, Faisal Tareque Shohan, Amine Trabelsi, Richard Khoury

Abstract

Multi-party social dialogue remains underexplored in the literature,in part due to the difficulty and cost of evaluation. As a result,recent work on synthetic dialogue generation often relies on automatedmetrics and LLM-as-a-Judge frameworks, despite limited evidence thatsuch judges reflect human preferences in social settings. In this work,we introduce a lightweight and controllable multi-party dialoguegeneration framework (MPOD) as an experimental instrument forstudying generation and evaluation in social interaction. Using thisframework, we conduct human evaluations of open-domain multi-partydialogue simulation and directly compare human judgments againststate-of-the-art LLM judges. Across 319 pairwise comparisons, weobserve near-random agreement between humans and automated judges(Cohen’s 𝜅 ≈ 0.11), driven by systematic behaviorsincluding extreme tie aversion and strong sensitivity toassistant-style verbosity. Crucially, human–human inter-annotatoragreement (𝜅 = 0.29) is substantially higher than human–LLMagreement. To isolate themechanism underlying this misalignment, we introduce a controlledTransplant Ablation, showing that LLM judges consistentlyprefer conversations containing a single proprietary, assistant-styleagent. Additional stress tests show that judges prefer GPT-styleconversations even when utterance order is randomly shuffled,indicating insensitivity to conversational structure and coherence.Our findings provide controlled evidence that currentinstruction-tuned LLM judges do not reliably reflect human preferences for naturalness, engagingness, and overall quality in multi-party social dialogue, calling into question their widespreaduse for validating synthetic conversational data.

Anthology ID:: 2026.acl-long.2006
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 43326–43345
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2006/
DOI:
Bibkey:
Cite (ACL):: Kunal Samanta, Faisal Tareque Shohan, Amine Trabelsi, and Richard Khoury. 2026. Simple Agents, Biased Judges: Efficient Multi-Party Dialogue Generation & The Evaluation Gap. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 43326–43345, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Simple Agents, Biased Judges: Efficient Multi-Party Dialogue Generation & The Evaluation Gap (Samanta et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2006.pdf
Checklist:: 2026.acl-long.2006.checklist.pdf

PDF Cite Search Checklist Fix data