Joint Optimization of Training Data and Policy in RLHF
Zhuohao Yu, Jiali Zeng, Weizheng Gu, Mengyuan Sun, Yidong Wang, Fandong Meng, Jie Zhou, Shikun Zhang, Wei Ye
Abstract
Traditional reinforcement learning from human feedback (RLHF) optimizes policies on fixed training inputs, limiting the diversity of learning signals. We propose JODP (Joint Optimization of Data and Policy), a framework where the evolving policy model generates improved variants of training problems to enhance its own learning. While training problems remain fixed, JODP optimizes how they are presented: the policy generates specification hints that guide rollout generation, then learns to reproduce the discovered high-reward behaviors without the hints. This "if you can solve it with a hint, learn to solve it without one" principle creates a co-evolutionary dynamic where better policies discover better specifications, which enable further policy improvement. JODP operates as a plug-and-play enhancement to existing algorithms: specifications are selected via UCB bandits for exploration-exploitation balance, used only during training rollouts, and discarded at deployment. Through evaluation on safety alignment tasks, we demonstrate consistent improvements with GRPO, RLOO, and REINFORCE++, allowing 4B models to approach 8B model performance using less than 1% additional computational overhead.- Anthology ID:
- 2026.findings-acl.2109
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 42498–42514
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2109/
- DOI:
- Cite (ACL):
- Zhuohao Yu, Jiali Zeng, Weizheng Gu, Mengyuan Sun, Yidong Wang, Fandong Meng, Jie Zhou, Shikun Zhang, and Wei Ye. 2026. Joint Optimization of Training Data and Policy in RLHF. In Findings of the Association for Computational Linguistics: ACL 2026, pages 42498–42514, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Joint Optimization of Training Data and Policy in RLHF (Yu et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2109.pdf