Mingze Wang
2026
Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward
Guanhua Huang | Tingqiang Xu | Mingze Wang | Qi Yi | Xue Gong | Siheng Li | Ruibin Xiong | Kejiao Li | Yuhao Jiang | Bo Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Guanhua Huang | Tingqiang Xu | Mingze Wang | Qi Yi | Xue Gong | Siheng Li | Ruibin Xiong | Kejiao Li | Yuhao Jiang | Bo Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. While previous methods attempt to maintain high entropy, we argue that unselective entropy maximization risks amplifying irrelevant noise rather than fostering meaningful exploration. In this paper, we identify a deeper issue: the gradual elimination of valuable low-probability exploratory tokens, which we term reasoning sparks, driven by RLVR over-penalization. To address this, we introduce Low-probability Regularization (Lp-Reg). Leveraging the statistical distinction where reasoning sparks exhibit higher probabilities than noise, Lp-Reg filters out the extremely low-probability noise tokens and prevents the suppression of potentially valuable low-probability candidates. Experiments demonstrate that Lp-Reg enables stable on-policy training for over 3,000 steps (81,204 GPU-hours), sustaining exploration in regimes where baselines typically collapse. Validated across extensive evaluations totaling over 300,000 cumulative GPU-hours, Lp-Reg demonstrates highly competitive performance in off-policy settings and consistently achieves state-of-the-art results in on-policy training across diverse model families, sizes, and domains, with relative accuracy improvements ranging from 3.06% to 7.98%.
2025
Tunable LLM-based Proactive Recommendation Agent
Mingze Wang | Chongming Gao | Wenjie Wang | Yangyang Li | Fuli Feng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mingze Wang | Chongming Gao | Wenjie Wang | Yangyang Li | Fuli Feng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recommender systems are indispensable on various digital platforms. However, traditional methods often reinforce existing user interests, which leads to echo chambers and limits diversity. Proactive Recommendation Systems (PRS) aim to address this issue by cultivating users’ latent interests through multi-step recommendations. Despite advancements, challenges persist particularly in optimizing long-term rewards and adapting to real-time user feedback. In this study, we propose an LLM-based Actor-Critic Agent framework to enhance PRS. This framework utilizes the LLM-based agent to adjust recommendations in real time based on feedback and employs agent-tuning methods to optimize long-term rewards using three proposed reward functions. Extensive experiments validate the significant superiority of this framework over existing methods by optimizing long-term rewards and dynamically evolving with user feedback.
2024
Are AI-Generated Text Detectors Robust to Adversarial Perturbations?
Guanhua Huang | Yuchen Zhang | Zhe Li | Yongjian You | Mingze Wang | Zhouwang Yang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Guanhua Huang | Yuchen Zhang | Zhe Li | Yongjian You | Mingze Wang | Zhouwang Yang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The widespread use of large language models (LLMs) has sparked concerns about the potential misuse of AI-generated text, as these models can produce content that closely resembles human-generated text. Current detectors for AI-generated text (AIGT) lack robustness against adversarial perturbations, with even minor changes in characters or words causing a reversal in distinguishing between human-created and AI-generated text. This paper investigates the robustness of existing AIGT detection methods and introduces a novel detector, the Siamese Calibrated Reconstruction Network (SCRN). The SCRN employs a reconstruction network to add and remove noise from text, extracting a semantic representation that is robust to local perturbations. We also propose a siamese calibration technique to train the model to make equally confident predictions under different noise, which improves the model’s robustness against adversarial perturbations. Experiments on four publicly available datasets show that the SCRN outperforms all baseline methods, achieving 6.5%-18.25% absolute accuracy improvement over the best baseline method under adversarial attacks. Moreover, it exhibits superior generalizability in cross-domain, cross-genre, and mixed-source scenarios. The code is available at https://github.com/CarlanLark/Robust-AIGC-Detector.