LRBench and Judge-R1: Principled Evaluation and Training of LLM-Based Judges for Long-Context Reasoning

Xinyi Zhao, Haoqi Hu, Ziyu Wang, Jinfeng Xiao


Abstract
Large language models (LLMs) are increasingly used as judges to evaluate, rank, and supervise other models, yet their reliability in judging LLMs’ reasoning process under long-context settings remains underexplored. Existing benchmarks either overly rely on human annotators, who may miss subtle flaws in lengthy reasoning chains, or focus solely on final responses while ignoring the underlying context and reasoning process. We introduce Long-Reason Bench (LRBench), a large-scale benchmark for evaluating LLM-based judges. LRBench comprises over 100K annotated instances spanning medical, legal, and academic-review scenarios, with fine-grained labels indicating violations of six core principles: Logical Correctness, Factual Consistency, Bias and Fairness, Groundedness, Helpfulness, and Harmlessness. Experimental results reveal that state-of-the-art LLM judges struggle to identify nuanced reasoning errors in long contexts. To improve judge reliability, we further present Judge-R1, which combines reinforcement learning with multi-turn search to enable grounded and principle-aware evaluation. Across domains and principles, Judge-R1 consistently outperforms single-turn baselines, enabling scalable and trustworthy evaluation of LLM reasoning. Our dataset and code are available at https://github.com/Xinyi-0724/Judge-R1.
Anthology ID:
2026.findings-acl.2029
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
40839–40861
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2029/
DOI:
Bibkey:
Cite (ACL):
Xinyi Zhao, Haoqi Hu, Ziyu Wang, and Jinfeng Xiao. 2026. LRBench and Judge-R1: Principled Evaluation and Training of LLM-Based Judges for Long-Context Reasoning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 40839–40861, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
LRBench and Judge-R1: Principled Evaluation and Training of LLM-Based Judges for Long-Context Reasoning (Zhao et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2029.pdf
Checklist:
 2026.findings-acl.2029.checklist.pdf