Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning

Qianxi He, Qingyu Ren, Shanzhe Lei, Xuhong Wang, Yingchun Wang


Abstract
Recent advancements in large language models (LLMs) have shifted the post-training paradigm from traditional instruction tuning and human preference alignment toward reinforcement learning (RL) focused on reasoning capabilities. However, most current methods rely on rule-based evaluations of answer correctness, overlooking the importance of confidence-aware reasoning, especially for small to medium-sized models. These models often receive rewards for speculative answers without generating coherent reasoning chains. To address this limitation, we propose a novel confidence-based reward model tailored for enhancing STEM reasoning capabilities. Unlike conventional approaches, our model penalizes not only incorrect answers but also low-confidence correct responses, thereby promoting more robust and logically consistent reasoning. We validate the effectiveness of our approach through static evaluations, Best-of-N inference tests, and PPO-based RL training. Our method outperforms several state-of-the-art open-source reward models across diverse STEM benchmarks. We release our codes and model in https://github.com/qianxiHe147/C2RM.
Anthology ID:
2025.emnlp-main.1385
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
27215–27231
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.1385/
DOI:
10.18653/v1/2025.emnlp-main.1385
Bibkey:
Cite (ACL):
Qianxi He, Qingyu Ren, Shanzhe Lei, Xuhong Wang, and Yingchun Wang. 2025. Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27215–27231, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning (He et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.1385.pdf
Checklist:
 2025.emnlp-main.1385.checklist.pdf