CourtEval: A Courtroom-Based Multi-Agent Evaluation Framework

Sandeep Kumar, Abhijit A Nargund, Vivek Sridhar


Abstract
Automated evaluation is crucial for assessing the quality of natural language text, especially in open-ended generation tasks, given the costly and time-consuming nature of human evaluation. Existing automatic evaluation metrics like ROUGE and BLEU often show low correlation with human judgments. As large language models (LLMs) continue to evolve, researchers have explored their use as alternatives to human evaluators. Although single-agent approaches have shown potential, results indicate that further progress is required to close the gap between their performance and the quality of human assessments. Acknowledging that human evaluations involve multiple annotators, the multi-agent approach allows LLMs to collaborate, enhancing efficiency and effectiveness in handling complex tasks. In this paper, we present CourtEval, a novel Multi-Agent Evaluation Framework modeled after courtroom dynamics. Each agent takes on a distinct role: the Grader, similar to a judge, assigns an initial score; the Critic, like a prosecutor, challenges this score; and the Defender, akin to a defense attorney, defends it. Based on the input from both the Critic and Defender, the Grader re-evaluates the score, leading to a more balanced and fair final decision through this adversarial process. CourtEval substantially outperforms the previous state-of-the-art methods in two meta-evaluation benchmarks in NLG evaluation, SummEval and TopicalChat.
Anthology ID:
2025.findings-acl.1327
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:
Findings | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25875–25887
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1327/
DOI:
Bibkey:
Cite (ACL):
Sandeep Kumar, Abhijit A Nargund, and Vivek Sridhar. 2025. CourtEval: A Courtroom-Based Multi-Agent Evaluation Framework. In Findings of the Association for Computational Linguistics: ACL 2025, pages 25875–25887, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
CourtEval: A Courtroom-Based Multi-Agent Evaluation Framework (Kumar et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1327.pdf