Xu He
2026
Turning Failures into Value: Negative Experience Replay for RLVR via Confidence Gating and Boundary Failure Sampling
Jialiang Guo | Fucheng Xiong | Xu He | Haodong Zhao | Xingyang li | Ke Zeng | Xunliang Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jialiang Guo | Fucheng Xiong | Xu He | Haodong Zhao | Xingyang li | Ke Zeng | Xunliang Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for enhancing reasoning capabilities in Large Language Models, yet on-policy algorithms like GRPO suffer from sample inefficiency. Current experience replay methods for RLVR typically replay correct trajectories to consolidate learned reasoning patterns and accelerate convergence, but overlook the vast failure space. This work investigates how to effectively replay failure trajectories. We find that the high heterogeneity of failures renders random replay ineffective, and that high-value negatives should be both gradient-efficient and structurally proximal to correct solutions. To this end, we propose NexGRPO, which employs mid-confidence gating to filter invalid noise and saturated errors, and utilizes boundary failure sampling to retrieve boundary errors semantically similar to correct solutions for targeted refinement. Extensive experiments on mathematical and general reasoning benchmarks demonstrate that NexGRPO outperforms strong baaselines and achieves improved out-of-distribution generalization.
2024
CTYUN-AI at SemEval-2024 Task 7: Boosting Numerical Understanding with Limited Data Through Effective Data Alignment
Yuming Fan | Dongming Yang | Xu He
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Yuming Fan | Dongming Yang | Xu He
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Large language models (LLMs) have demonstrated remarkable capabilities in pushing the boundaries of natural language understanding. Nevertheless, the majority of existing open-source LLMs still fall short of meeting satisfactory standards when it comes to addressing numerical problems, especially as the enhancement of their numerical capabilities heavily relies on extensive data.To bridge the gap, we aim to improve the numerical understanding of LLMs by means of efficient data alignment, utilizing only a limited amount of necessary data.Specifically, we first use a data discovery strategy to obtain the most effective portion of numerical data from large datasets. Then, self-augmentation is performed to maximize the potential of the training data. Thirdly, answers of all traning samples are aligned based on some simple rules. Finally, our method achieves the first place in the competition, offering new insights and methodologies for numerical understanding research in LLMs.