Turning Failures into Value: Negative Experience Replay for RLVR via Confidence Gating and Boundary Failure Sampling
Jialiang Guo, Fucheng Xiong, Xu He, Haodong Zhao, Xingyang li, Ke Zeng, Xunliang Cai
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for enhancing reasoning capabilities in Large Language Models, yet on-policy algorithms like GRPO suffer from sample inefficiency. Current experience replay methods for RLVR typically replay correct trajectories to consolidate learned reasoning patterns and accelerate convergence, but overlook the vast failure space. This work investigates how to effectively replay failure trajectories. We find that the high heterogeneity of failures renders random replay ineffective, and that high-value negatives should be both gradient-efficient and structurally proximal to correct solutions. To this end, we propose NexGRPO, which employs mid-confidence gating to filter invalid noise and saturated errors, and utilizes boundary failure sampling to retrieve boundary errors semantically similar to correct solutions for targeted refinement. Extensive experiments on mathematical and general reasoning benchmarks demonstrate that NexGRPO outperforms strong baaselines and achieves improved out-of-distribution generalization.- Anthology ID:
- 2026.acl-long.1682
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 36316–36334
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1682/
- DOI:
- Cite (ACL):
- Jialiang Guo, Fucheng Xiong, Xu He, Haodong Zhao, Xingyang li, Ke Zeng, and Xunliang Cai. 2026. Turning Failures into Value: Negative Experience Replay for RLVR via Confidence Gating and Boundary Failure Sampling. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 36316–36334, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Turning Failures into Value: Negative Experience Replay for RLVR via Confidence Gating and Boundary Failure Sampling (Guo et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1682.pdf