Yuanshun Yao

2026

Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, where models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are calculated on the current task, thus they do not guarantee broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training focus each objective should receive. To address this, we propose RECAP—a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts in an online manner using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks based on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.

2025

pdf bib abs

Alignment of large language models (LLM) is a process that ensures the model’s responses to user prompts align with human intentions and social values. This optimization typically relies on pre-collected prompts. The collection of these prompts often either requires careful human interventions or proves to be difficult to have a good coverage over all scenarios an LLM can improve over . To address this issue, we propose an alignment method based on a two-agent game, consisting of an adversarial agent and a defensive agent. The adversarial agent’s task is to generate prompts that expose the deficiencies of the defensive agent. At the same time, the defensive agent improves its performance on the prompts generated by the adversary based on feedback from the reward model. This iterative process is repeated to enhance the model’s performance. We theoretically demonstrate that, under mild assumptions, this iterative alignment process converges to a Nash equilibrium by both agents. Learning in this competitive environment results in policies with better generalization capabilities. We demonstrate the advantage of our framework using extensive experiments.