Weizheng Gu
2026
Joint Optimization of Training Data and Policy in RLHF
Zhuohao Yu | Jiali Zeng | Weizheng Gu | Mengyuan Sun | Yidong Wang | Fandong Meng | Jie Zhou | Shikun Zhang | Wei Ye
Findings of the Association for Computational Linguistics: ACL 2026
Zhuohao Yu | Jiali Zeng | Weizheng Gu | Mengyuan Sun | Yidong Wang | Fandong Meng | Jie Zhou | Shikun Zhang | Wei Ye
Findings of the Association for Computational Linguistics: ACL 2026
Traditional reinforcement learning from human feedback (RLHF) optimizes policies on fixed training inputs, limiting the diversity of learning signals. We propose JODP (Joint Optimization of Data and Policy), a framework where the evolving policy model generates improved variants of training problems to enhance its own learning. While training problems remain fixed, JODP optimizes how they are presented: the policy generates specification hints that guide rollout generation, then learns to reproduce the discovered high-reward behaviors without the hints. This "if you can solve it with a hint, learn to solve it without one" principle creates a co-evolutionary dynamic where better policies discover better specifications, which enable further policy improvement. JODP operates as a plug-and-play enhancement to existing algorithms: specifications are selected via UCB bandits for exploration-exploitation balance, used only during training rollouts, and discarded at deployment. Through evaluation on safety alignment tasks, we demonstrate consistent improvements with GRPO, RLOO, and REINFORCE++, allowing 4B models to approach 8B model performance using less than 1% additional computational overhead.