Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning
Qianxi He, Qingyu Ren, Shanzhe Lei, Xuhong Wang, Yingchun Wang
Abstract
Recent advancements in large language models (LLMs) have shifted the post-training paradigm from traditional instruction tuning and human preference alignment toward reinforcement learning (RL) focused on reasoning capabilities. However, most current methods rely on rule-based evaluations of answer correctness, overlooking the importance of confidence-aware reasoning, especially for small to medium-sized models. These models often receive rewards for speculative answers without generating coherent reasoning chains. To address this limitation, we propose a novel confidence-based reward model tailored for enhancing STEM reasoning capabilities. Unlike conventional approaches, our model penalizes not only incorrect answers but also low-confidence correct responses, thereby promoting more robust and logically consistent reasoning. We validate the effectiveness of our approach through static evaluations, Best-of-N inference tests, and PPO-based RL training. Our method outperforms several state-of-the-art open-source reward models across diverse STEM benchmarks. We release our codes and model in https://github.com/qianxiHe147/C2RM.- Anthology ID:
- 2025.emnlp-main.1385
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 27215–27231
- Language:
- URL:
- https://preview.aclanthology.org/lei-li-partial-disambiguation/2025.emnlp-main.1385/
- DOI:
- Cite (ACL):
- Qianxi He, Qingyu Ren, Shanzhe Lei, Xuhong Wang, and Yingchun Wang. 2025. Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27215–27231, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning (He et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/lei-li-partial-disambiguation/2025.emnlp-main.1385.pdf