Using LLM Judgements for Sanity Checking Results and Reproducibility of Human Evaluations in NLP

Rudali Huidrom; Anja Belz

Using LLM Judgements for Sanity Checking Results and Reproducibility of Human Evaluations in NLP

Abstract

Human-like evaluation by LLMs of NLP systems is currently attracting a lot of interest, and correlations with human reference evaluations are often remarkably strong. However, this is not always the case, for unclear reasons which means that without also meta-evaluating against human evaluations (incurring the very cost automatic evaluation is intended to avoid), we don’t know if an LLM-as-judge evaluation is reliable or not. In this paper, we explore a type of evaluation scenario where this may not matter, because it comes with a built-in reliability check. We apply different LLM-as-judge methods to sets of three comparable human evaluations: (i) an original human evaluation, and (ii) two reproductions of it which produce contradicting reproducibility results. We find that in each case, the different LLM-as-judge methods (i) strongly agree with each other, and (ii) strongly agree with the results of one reproduction, while strongly disagreeing with the other. In combination, we take this to mean that a set of LLMs can be used to sanity check contradictory reproducibility results if the LLMs agree with each other, and the agreement of the LLMs with one set of results, and the disagreement with the other, are both strong.

Anthology ID:: 2025.gem-1.30
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Kaustubh Dhole, Miruna Clinciu
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 354–365
Language:
URL:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.30/
DOI:
Bibkey:
Cite (ACL):: Rudali Huidrom and Anya Belz. 2025. Using LLM Judgements for Sanity Checking Results and Reproducibility of Human Evaluations in NLP. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 354–365, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: Using LLM Judgements for Sanity Checking Results and Reproducibility of Human Evaluations in NLP (Huidrom & Belz, GEM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.30.pdf

PDF Cite Search Fix data