Yang Liu

Other people with similar names: Yang Janet Liu (Georgetown University; 刘洋), Yang Liu (Tsinghua), Yang Liu (Fudan), Yang Liu (BIGAI), Yang Liu, Yang Liu (Hunan), Yang Liu (3M Health Information Systems), Yang Liu, Yang Liu, Yang Liu (UC Santa Cruz), Yang Liu (South China University of Technology), Yang Liu, Yang Liu, Yang Liu (NTU), Yang Liu (Sun Yat-sen University), Yang Liu (North Carolina Central University), Yang Liu (Beijing Language and Culture University), Yang Liu (National University of Defense Technology), Yang Liu (Edinburgh Ph.D., Microsoft), Yang Liu (University of Helsinki), Yang Liu (The Chinese University of Hong Kong (Shenzhen)), Yang Liu (刘扬) (刘扬; Ph.D Purdue; ICSI, Dallas, Facebook, Liulishuo, Amazon), Yang Liu (刘洋) (刘洋; ICT, Tsinghua, Beijing Academy of Artificial Intelligence), Yang Liu (Microsoft Cognitive Services Research), Yang Liu (刘扬) (Peking University), Yang Liu (Samsung Research Center Beijing), Yang Liu (Tianjin University, China), Yang Liu (Univ. of Michigan, UC Santa Cruz), Yang Liu (Wilfrid Laurier University)

Unverified author pages with similar names: Yang Liu


2026

Reinforcement learning (RL) is effective for improving code generation but suffers from data scarcity. While experience replay mitigates this, existing approaches rely on static, in-epoch metrics that overlook training dynamics, often introducing low-utility or outdated data. Analyzing RL dynamics via dataset cartography, we observe that “ambiguous” samples, which are vital for model generalization, rapidly migrate to “easy-to-learn” regions, diminishing their training value. To address this, we propose Adaptive Ambiguity Replay (A2R) for RL, a plug-and-play module that prioritizes cross-epoch ambiguous samples. To neutralize the noise from stale experiences, A2R incorporates an adaptive importance mechanism based on policy divergence to weigh replayed rollouts. Extensive experiments on nine LLMs (3B–14B) demonstrate that A2R outperforms state-of-the-art baselines on real-world code editing tasks across both unseen and learned domains. Our results highlight cross-epoch ambiguity as a key factor for effective replay in RL. Code: https://github.com/TsingZ0/verl-A2R