SPO: Self Preference Optimization with Self Regularization

Yuhao Sun; Yifan Zhang; Quandong Wang; Qinzhuo Wu; Wei Liu; Jian Luan

doi:10.18653/v1/2025.findings-emnlp.300

SPO: Self Preference Optimization with Self Regularization

Yuhao Sun, Yifan Zhang, Quandong Wang, Qinzhuo Wu, Wei Liu, Jian Luan

Abstract

Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that enhances the simplicity and training stability of reinforcement learning through reward function reparameterization from PPO. Recently, SimPO (Simple Preference Optimization) and CPO (Contrastive Preference Optimization) have proposed reference-free preference optimization methods to simplify DPO’s training process. We observe that these reference-free methods exhibit higher training efficiency but are prone to overoptimization, leading to performance degradation. To address these issues, we propose Self Preference Optimization (SPO). SPO employs the SiLU function to replace the conventional logsigmoid loss function. The SiLU function attains its minimum at a finite value, preventing the model from excessively amplifying the chosen-rejected sample probability ratio and thereby mitigating overoptimization problem. We theoretically demonstrate that the SPO loss is an upper bound of the DPO loss, implying that optimizing the SPO objective implicitly optimizes the DPO objective. We evaluate SPO’s effectiveness across multiple benchmarks including AlpacaEval 2 and MT-Bench. Experimental results show that SPO achieves a 7% improvement over SimPO in length-controlled win rate on AlpacaEval 2, while demonstrating superior performance on MT-Bench.

Anthology ID:: 2025.findings-emnlp.300
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5601–5614
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.300/
DOI:: 10.18653/v1/2025.findings-emnlp.300
Bibkey:
Cite (ACL):: Yuhao Sun, Yifan Zhang, Quandong Wang, Qinzhuo Wu, Wei Liu, and Jian Luan. 2025. SPO: Self Preference Optimization with Self Regularization. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5601–5614, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: SPO: Self Preference Optimization with Self Regularization (Sun et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.300.pdf
Checklist:: 2025.findings-emnlp.300.checklist.pdf

PDF Cite Search Checklist Fix data