RubricBench: Aligning Model-Generated Rubrics with Human Standards
Junyi Zhou, Qiyuan Zhang, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, Chen Ma
Abstract
As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.- Anthology ID:
- 2026.acl-long.1439
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 31179–31200
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1439/
- DOI:
- Cite (ACL):
- Junyi Zhou, Qiyuan Zhang, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, and Chen Ma. 2026. RubricBench: Aligning Model-Generated Rubrics with Human Standards. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31179–31200, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- RubricBench: Aligning Model-Generated Rubrics with Human Standards (Zhou et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1439.pdf