Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards

Xinyu Tang; Yuliang Zhan; Zhixun Li; Wayne Xin Zhao; Zhenduo Zhang; Zujie Wen; Zhiqiang Zhang; Jun Zhou

Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards

Xinyu Tang, Yuliang Zhan, Zhixun Li, Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, Jun Zhou

Abstract

Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct ***sample polarities***. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the polarity level and the token level affects RLVR training. Based on these insights, we propose an **A**daptive and **A**symmetric token-level **A**dvantage shaping method for **P**olicy **O**ptimization, namely **A3PO**, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.

Anthology ID:: 2026.acl-long.134
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2928–2954
Language:
URL:: https://preview.aclanthology.org/check-for-anonymous-pdfs/2026.acl-long.134/
DOI:
Bibkey:
Cite (ACL):: Xinyu Tang, Yuliang Zhan, Zhixun Li, Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. 2026. Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2928–2954, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards (Tang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/check-for-anonymous-pdfs/2026.acl-long.134.pdf
Checklist:: 2026.acl-long.134.checklist.pdf

PDF Cite Search Checklist Fix data