An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability

Yusuke Yamauchi; Taro Yano; Masafumi Oyamada

An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability

Yusuke Yamauchi, Taro Yano, Masafumi Oyamada

Abstract

As large language models (LLMs) continue to advance, reliable evaluation methods are essential—particularly for open-ended, instruction-following tasks. LLM-as-a-Judge enables automatic evaluation using LLMs as evaluators, but its reliability remains uncertain. In this work, we analyze key factors affecting its trustworthiness, focusing on alignment with human judgments and evaluation consistency. Using BIGGENBench and EvalBiasBench, we study the effects of evaluation design, decoding strategies, and Chain-of-Thought (CoT) reasoning in evaluation. Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present.

Anthology ID:: 2026.gem-main.19
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 167–176
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.19/
DOI:
Bibkey:
Cite (ACL):: Yusuke Yamauchi, Taro Yano, and Masafumi Oyamada. 2026. An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 167–176, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability (Yamauchi et al., GEM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.19.pdf

PDF Cite Search Fix data