IRPO: Implicit Policy Regularized Preference Optimization
Youngsoo Jang, Yu Jin Kim, Geon-Hyeong Kim, Honglak Lee, Moontae Lee
Abstract
Training complexity often scales with the size of hyperparameter space for Large Language Models (LLMs). While Direct Preference Optimization (DPO) offers learning stability through reparameterizing the reward function, its regularization against the reference policy can lead to suboptimal outcomes when the reference policy is not optimal. Recent DPO variants address this concern, but at a cost: they introduce additional hyperparameters, reducing feasibility for LLM fine-tuning. To overcome this challenge, we introduce Implicit policy Regularized Preference Optimization (IRPO), which tackles suboptimality while maintaining training simplicity. By treating the winning policy that generated the chosen responses in a pairwise dataset as an implicit policy, IRPO maximizes KL-regularized reward without extra hyperparameters. Then we propose a novel PO algorithm that directly optimizes the IRPO objective by estimating the likelihood ratio between implicit policies. As the winning policy generally outperforms the reference policy, IRPO can effectively address suboptimality. Our experiments show that IRPO significantly outperforms baseline algorithms with the same hyperparameter complexity. Moreover, IRPO demonstrates comparable performance to recent algorithms that rely on a larger number of hyperparameters, offering a practical solution for scalable LLM fine-tuning.- Anthology ID:
- 2026.findings-eacl.281
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5304–5325
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.281/
- DOI:
- Cite (ACL):
- Youngsoo Jang, Yu Jin Kim, Geon-Hyeong Kim, Honglak Lee, and Moontae Lee. 2026. IRPO: Implicit Policy Regularized Preference Optimization. In Findings of the Association for Computational Linguistics: EACL 2026, pages 5304–5325, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- IRPO: Implicit Policy Regularized Preference Optimization (Jang et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.281.pdf