Selective Preference Optimization via Token-Level Reward Function Estimation

Kailai Yang; Zhiwei Liu; Qianqian Xie; Jimin Huang; Erxue Min; Sophia Ananiadou

Selective Preference Optimization via Token-Level Reward Function Estimation

Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Erxue Min, Sophia Ananiadou

Abstract

Recent advancements in LLM alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be noisy and inefficient, or perform selective training with complex and expensive key token selection strategies. In this work, we propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection without requiring strong, fine-grained supervision signals. We theoretically prove the feasibility of Direct Preference Optimization (DPO) as token-level reward function estimators, which applies to any existing alignment datasets and enables cost-efficient token selection with small-scale model sizes and training data. We then train an oracle model with DPO on the target data and utilize the estimated reward function to score all tokens within the target dataset, where only the key tokens are selected to supervise the target policy model with a contrastive objective function. Extensive experiments on three public evaluation benchmarks show that SePO significantly outperforms competitive baseline methods by only optimizing on 30% key tokens with up to 60% reduction in GPU training hours. We also explore SePO as a new paradigm for weak-to-strong generalization, showing that weak oracle models effectively supervise strong policy models with up to 16.8 more parameters. SePO also selects useful supervision signals from out-of-distribution data, alleviating the over-optimization problem.

Anthology ID:: 2025.emnlp-main.359
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7043–7067
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.359/
DOI:
Bibkey:
Cite (ACL):: Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Erxue Min, and Sophia Ananiadou. 2025. Selective Preference Optimization via Token-Level Reward Function Estimation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7043–7067, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Selective Preference Optimization via Token-Level Reward Function Estimation (Yang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.359.pdf
Checklist:: 2025.emnlp-main.359.checklist.pdf

PDF Cite Search Checklist Fix data