Tyler Bliss


2026

Evaluating multimodal generations is challenging: human evaluation is costly, and single-model LLM-as-a-judge pipelines can be brittle and provide limited uncertainty signals. We introduce Concord, an ensemble-based evaluation pipeline that aggregates discrete judgments from multiple LLM judges and uses inter-judge agreement as a practical uncertainty signal for disagreement-driven triage. We evaluate Concord on AVSSD and SCORE-AVS, a ground-truth-supervised audio-visual benchmark with discrete labels (True/False or 0–5). Concord improves agreement with human judgments over single-judge and naive aggregation baselines, and prioritizing low-agreement instances focuses human review on the most ambiguous cases. We use locally hosted open-source judges and include the binary results for online larger scale models GPT4.o mini turbo and Gemini 3.1 Flash Lite.