Turning Failures into Value: Negative Experience Replay for RLVR via Confidence Gating and Boundary Failure Sampling

Jialiang Guo, Fucheng Xiong, Xu He, Haodong Zhao, Xingyang li, Ke Zeng, Xunliang Cai


Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for enhancing reasoning capabilities in Large Language Models, yet on-policy algorithms like GRPO suffer from sample inefficiency. Current experience replay methods for RLVR typically replay correct trajectories to consolidate learned reasoning patterns and accelerate convergence, but overlook the vast failure space. This work investigates how to effectively replay failure trajectories. We find that the high heterogeneity of failures renders random replay ineffective, and that high-value negatives should be both gradient-efficient and structurally proximal to correct solutions. To this end, we propose NexGRPO, which employs mid-confidence gating to filter invalid noise and saturated errors, and utilizes boundary failure sampling to retrieve boundary errors semantically similar to correct solutions for targeted refinement. Extensive experiments on mathematical and general reasoning benchmarks demonstrate that NexGRPO outperforms strong baaselines and achieves improved out-of-distribution generalization.
Anthology ID:
2026.acl-long.1682
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
36316–36334
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1682/
DOI:
Bibkey:
Cite (ACL):
Jialiang Guo, Fucheng Xiong, Xu He, Haodong Zhao, Xingyang li, Ke Zeng, and Xunliang Cai. 2026. Turning Failures into Value: Negative Experience Replay for RLVR via Confidence Gating and Boundary Failure Sampling. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 36316–36334, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Turning Failures into Value: Negative Experience Replay for RLVR via Confidence Gating and Boundary Failure Sampling (Guo et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1682.pdf
Checklist:
 2026.acl-long.1682.checklist.pdf