Kunal Samanta
2026
Simple Agents, Biased Judges: Efficient Multi-Party Dialogue Generation & The Evaluation Gap
Kunal Samanta | Faisal Tareque Shohan | Amine Trabelsi | Richard Khoury
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kunal Samanta | Faisal Tareque Shohan | Amine Trabelsi | Richard Khoury
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-party social dialogue remains underexplored in the literature,in part due to the difficulty and cost of evaluation. As a result,recent work on synthetic dialogue generation often relies on automatedmetrics and LLM-as-a-Judge frameworks, despite limited evidence thatsuch judges reflect human preferences in social settings. In this work,we introduce a lightweight and controllable multi-party dialoguegeneration framework (MPOD) as an experimental instrument forstudying generation and evaluation in social interaction. Using thisframework, we conduct human evaluations of open-domain multi-partydialogue simulation and directly compare human judgments againststate-of-the-art LLM judges. Across 319 pairwise comparisons, weobserve near-random agreement between humans and automated judges(Cohen’s 𝜅 ≈ 0.11), driven by systematic behaviorsincluding extreme tie aversion and strong sensitivity toassistant-style verbosity. Crucially, human–human inter-annotatoragreement (𝜅 = 0.29) is substantially higher than human–LLMagreement. To isolate themechanism underlying this misalignment, we introduce a controlledTransplant Ablation, showing that LLM judges consistentlyprefer conversations containing a single proprietary, assistant-styleagent. Additional stress tests show that judges prefer GPT-styleconversations even when utterance order is randomly shuffled,indicating insensitivity to conversational structure and coherence.Our findings provide controlled evidence that currentinstruction-tuned LLM judges do not reliably reflect human preferences for naturalness, engagingness, and overall quality in multi-party social dialogue, calling into question their widespreaduse for validating synthetic conversational data.