Mediocrity is the key for LLM as a Judge Anchor Selection

Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend


Abstract
The “LLM-as-a-judge” paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.
Anthology ID:
2026.acl-long.706
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15491–15513
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.706/
DOI:
Bibkey:
Cite (ACL):
Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, and Omri Abend. 2026. Mediocrity is the key for LLM as a Judge Anchor Selection. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15491–15513, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Mediocrity is the key for LLM as a Judge Anchor Selection (Don-Yehiya et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.706.pdf
Checklist:
 2026.acl-long.706.checklist.pdf