Expectation Preference Optimization: Reliable Preference Estimation for Improving the Reasoning Capability of Large Language Models

Zelin Li; Dawei Song

Expectation Preference Optimization: Reliable Preference Estimation for Improving the Reasoning Capability of Large Language Models

Abstract

Pairwise preference optimization, such as Direct Preference Optimization (DPO), was originally designed to align large language models (LLMs) with human values. It has recently been used to improve the supervised fine-tuning (SFT) performance of LLMs. Using pairs of single samples, DPO estimates the probability distribution of the preferences of picking one response over another. However, in tasks that involve more complicated preferences (e.g., reasoning tasks) than those in the human value alignment task, this sampling method is likely to bring deviations from the ground-truth distribution. To solve the problem, extra efforts (e.g., external annotations or amendment of the loss function) are often required. In this paper, we hypothesise that the preferences can be better estimated through a multi-sampling process. Accordingly, we propose an Expectation Preference Optimization (EPO) algorithm that takes pairs of sample groups, instead of pairs of single samples as in DPO, for preference learning. Compared to pairwise DPO, the proposed EPO tends to produce more reliable preference estimations. Applying different preference optimization methods in a self-training paradigm, we have conducted extensive experiments on various reasoning benchmarks. The results show that our EPO approach outperforms a range of baseline approaches in terms of zero-shot accuracy on all benchmarks.

Anthology ID:: 2025.emnlp-main.1532
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30119–30134
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1532/
DOI:
Bibkey:
Cite (ACL):: Zelin Li and Dawei Song. 2025. Expectation Preference Optimization: Reliable Preference Estimation for Improving the Reasoning Capability of Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30119–30134, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Expectation Preference Optimization: Reliable Preference Estimation for Improving the Reasoning Capability of Large Language Models (Li & Song, EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1532.pdf
Checklist:: 2025.emnlp-main.1532.checklist.pdf

PDF Cite Search Checklist Fix data