Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

Hui Huang, Xuanxin Wu, Muyun Yang, Yuki Arase


Abstract
This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judges to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior evaluation instruction-following capabilities; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong evaluation biases. To mitigate this bias vulnerability, we propose PlanJudge, a lightweight evaluation strategy that prompts the model to generate an explicit evaluation plan before executing the judgment. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in LLM-as-a-Judge while preserving overall judgment accuracy1.
Anthology ID:
2026.evaleval-1.13
Volume:
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Month:
July
Year:
2026
Address:
San Diego, CA
Editors:
Mubashara Akhtar, Jan Batzner, Leshem Choshen, Avijit Ghosh, Usman Gohar, Jennifer Mickel, Ichhya Pant, Zeerak Talat, Michelle Lin
Venues:
EvalEval | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
70–81
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.13/
DOI:
Bibkey:
Cite (ACL):
Hui Huang, Xuanxin Wu, Muyun Yang, and Yuki Arase. 2026. Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases. In Proceedings of the Workshop on Evaluating Evaluations (EvalEval), pages 70–81, San Diego, CA. Association for Computational Linguistics.
Cite (Informal):
Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases (Huang et al., EvalEval 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.13.pdf