CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

Xinyu Hu; Yancheng He; Weixun Wang; Tao Feng; Li Lin; Jiashun Liu; Wenbo Su; Bo Zheng; Xiaojun Wan

CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

Xinyu Hu, Yancheng He, Weixun Wang, Tao Feng, Li Lin, Jiashun Liu, Wenbo Su, Bo Zheng, Xiaojun Wan

Abstract

Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose **CE-RM-4B**, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.

Anthology ID:: 2026.findings-acl.982
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19629–19642
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.982/
DOI:
Bibkey:
Cite (ACL):: Xinyu Hu, Yancheng He, Weixun Wang, Tao Feng, Li Lin, Jiashun Liu, Wenbo Su, Bo Zheng, and Xiaojun Wan. 2026. CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria. In Findings of the Association for Computational Linguistics: ACL 2026, pages 19629–19642, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria (Hu et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.982.pdf
Checklist:: 2026.findings-acl.982.checklist.pdf

PDF Cite Search Checklist Fix data