Binbin Zheng
2026
MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning
Xiaoliang Fu | Jiaye Lin | Yangyi Fang | Binbin Zheng | Chaowen Hu | Zekai Shao | Cong Qin | Lu Pan | Ke Zeng | Xunliang Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiaoliang Fu | Jiaye Lin | Yangyi Fang | Binbin Zheng | Chaowen Hu | Zekai Shao | Cong Qin | Lu Pan | Ke Zeng | Xunliang Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence. Extensive evaluations demonstrate that MASPO serves as a robust, all-in-one RLVR solution, significantly outperforming baselines. Our code is available at: https://github.com/FlyTune/MASPO-RL.
From log 𝜋 to 𝜋: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight
Xiaoliang Fu | Jiaye Lin | Yangyi Fang | Chaowen Hu | Cong Qin | Zekai Shao | Binbin Zheng | Lu Pan | Ke Zeng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiaoliang Fu | Jiaye Lin | Yangyi Fang | Chaowen Hu | Cong Qin | Zekai Shao | Binbin Zheng | Lu Pan | Ke Zeng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via "hard clipping", which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent "soft clipping" methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient (∇𝜃log 𝜋𝜃) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient (∇𝜃 𝜋𝜃) as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek-R1-Distill-Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust and scalable solution for RLVR. Our code and implementation are available at: https://github.com/FlyTune/DGPO-RL.
Maximizing Local Entropy Where It Matters: Prefix-Aware Localized LLM Unlearning
Naixin Zhai | Pengyang Shao | Binbin Zheng | Yonghui Yang | Fei Shen | Long Bai | Xun Yang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Naixin Zhai | Pengyang Shao | Binbin Zheng | Yonghui Yang | Fei Shen | Long Bai | Xun Yang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Machine unlearning aims to forget sensitive knowledge from Large Language Models (LLMs) while maintaining general utility. However, existing approaches typically treat all tokens in a response indiscriminately and enforce uncertainty over the entire vocabulary. This global treatment results in unnecessary utility degradation and extends optimization to content-agnostic regions. To address these limitations, we propose PALU (Prefix-Aware Localized Unlearning), a framework driven by a local entropy maximization objective across both temporal and vocabulary dimensions. PALU reveals that (i) suppressing the sensitive prefix alone is sufficient to sever the causal generation link, and (ii) flattening only the top-K logits is adequate to maximize uncertainty in the critical subspace. These findings allow PALU to alleviate redundant optimization across the full vocabulary and parameter space while minimizing collateral damage to general model performance. Comprehensive evaluations validate that PALU achieves superior forgetting efficacy and utility preservation compared to state-of-the-art baselines. Our code is available at https://github.com/nxZhai/PALU.