Rushuai Yang
2025
Supervised Optimism Correction: Be Confident When LLMs Are Sure
Junjie Zhang
|
Rushuai Yang
|
Shunyu Liu
|
Ting-En Lin
|
Fei Huang
|
Yi Chen
|
Yongbin Li
|
Dacheng Tao
Findings of the Association for Computational Linguistics: ACL 2025
In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit Q-function for inference.Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated Q-value estimations of suboptimal steps. To address this limitation, we propose **S**upervised **O**ptimism **C**orrection (SOC), which introduces a simple yet effective auxiliary loss for token-level Q-value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularizationto boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses.Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.
Search
Fix author
Co-authors
- Yi Chen (陈奕) 1
- Fei Huang 1
- Yongbin Li 1
- Ting-En Lin 1
- Shunyu Liu 1
- show all...