Weizheng Gu


2026

Traditional reinforcement learning from human feedback (RLHF) optimizes policies on fixed training inputs, limiting the diversity of learning signals. We propose JODP (Joint Optimization of Data and Policy), a framework where the evolving policy model generates improved variants of training problems to enhance its own learning. While training problems remain fixed, JODP optimizes how they are presented: the policy generates specification hints that guide rollout generation, then learns to reproduce the discovered high-reward behaviors without the hints. This "if you can solve it with a hint, learn to solve it without one" principle creates a co-evolutionary dynamic where better policies discover better specifications, which enable further policy improvement. JODP operates as a plug-and-play enhancement to existing algorithms: specifications are selected via UCB bandits for exploration-exploitation balance, used only during training rollouts, and discarded at deployment. Through evaluation on safety alignment tasks, we demonstrate consistent improvements with GRPO, RLOO, and REINFORCE++, allowing 4B models to approach 8B model performance using less than 1% additional computational overhead.