Concord: An Agreement-Aware Multi-Adjudication Pipeline for LLM Evaluation
Tyler Bliss, Mahit Verma, Aila Iyer-Singh, Subrata Biswas, Sheikh Asif Imran, Bashima Islam
Abstract
Evaluating multimodal generations is challenging: human evaluation is costly, and single-model LLM-as-a-judge pipelines can be brittle and provide limited uncertainty signals. We introduce Concord, an ensemble-based evaluation pipeline that aggregates discrete judgments from multiple LLM judges and uses inter-judge agreement as a practical uncertainty signal for disagreement-driven triage. We evaluate Concord on AVSSD and SCORE-AVS, a ground-truth-supervised audio-visual benchmark with discrete labels (True/False or 0–5). Concord improves agreement with human judgments over single-judge and naive aggregation baselines, and prioritizing low-agreement instances focuses human review on the most ambiguous cases. We use locally hosted open-source judges and include the binary results for online larger scale models GPT4.o mini turbo and Gemini 3.1 Flash Lite.- Anthology ID:
- 2026.gem-main.46
- Volume:
- Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
- Venues:
- GEM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 502–510
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.46/
- DOI:
- Cite (ACL):
- Tyler Bliss, Mahit Verma, Aila Iyer-Singh, Subrata Biswas, Sheikh Asif Imran, and Bashima Islam. 2026. Concord: An Agreement-Aware Multi-Adjudication Pipeline for LLM Evaluation. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 502–510, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- Concord: An Agreement-Aware Multi-Adjudication Pipeline for LLM Evaluation (Bliss et al., GEM 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.46.pdf