Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning

Zhaohui Yang, Yuxiao Ye, Shilei Jiang, Shihong Deng, Chen Hu, Linjing Li, Daxin Jiang


Abstract
Recent advances in reasoning language models have witnessed a paradigm shift from short to long CoT pattern. Given the substantial computational cost of rollouts in long CoT models, maximizing the utility of fixed training datasets becomes crucial. Our analysis reveals that negative responses contain valuable components such as self-reflection and error-correction steps, yet primary existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL), failing to leverage these potential learning signals. In light of this, we propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline RL framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples. Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset, achieving improved sample efficiency and demonstrating robustness and scalability when extended to multiple iterations.
Anthology ID:
2025.findings-emnlp.57
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1061–1075
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.57/
DOI:
10.18653/v1/2025.findings-emnlp.57
Bibkey:
Cite (ACL):
Zhaohui Yang, Yuxiao Ye, Shilei Jiang, Shihong Deng, Chen Hu, Linjing Li, and Daxin Jiang. 2025. Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1061–1075, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning (Yang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.57.pdf
Checklist:
 2025.findings-emnlp.57.checklist.pdf