The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators

Tianruo Rose Xu; Vedant Gaur; Liu Leqi; Tanya Goyal

The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators

Tianruo Rose Xu, Vedant Gaur, Liu Leqi, Tanya Goyal

Abstract

LLM judges have gained popularity as an inexpensive and performant substitute for human evaluation. However, we observe that the meta-evaluation setting in which the reliability of these LLM evaluators is established is substantially different from their use in model development. To address this, we revisit meta-evaluations of LLM evaluators under a setting that more closely aligns with practice by examining evaluators’ ability to distinguish test system pairs that are closer in capability. Our fine-grained approach shows that all LLM evaluator’s correlations with human judgments are concerningly low when the models perform similarly, showcasing a key limitation of current norms. Equipped with this better methodology, we next analyze the impact that the choice of the reference model makes to LLM-as-a-judge evaluator performance. We show that single-reference evaluators only perform well at ranking test systems that fall within particular capability ranges, even if the standard meta-evaluation reports high overall correlation. Taken together, our analysis shows critical issues with current LLM meta-evaluation and recommend avenues for improvement.

Anthology ID:: 2025.findings-emnlp.1036
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19033–19043
Language:
URL:: https://preview.aclanthology.org/lei-li-partial-disambiguation/2025.findings-emnlp.1036/
DOI:
Bibkey:
Cite (ACL):: Tianruo Rose Xu, Vedant Gaur, Liu Leqi, and Tanya Goyal. 2025. The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 19033–19043, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators (Xu et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/lei-li-partial-disambiguation/2025.findings-emnlp.1036.pdf
Checklist:: 2025.findings-emnlp.1036.checklist.pdf

PDF Cite Search Checklist Fix data