Offline Preference Optimization via Maximum Marginal Likelihood Estimation

Saeed Najafi; Alona Fyshe

Offline Preference Optimization via Maximum Marginal Likelihood Estimation

Abstract

Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this work, we propose a new, simpler approach that recasts alignment through the lens of Maximum Marginal Likelihood (MML) estimation. Our new MML-based Preference Optimization (MMPO) maximizes the marginal log-likelihood of a preferred text output, using the preference pair as samples for approximation, and forgoes the need for both an explicit reward model and entropy maximization. We theoretically demonstrate that MMPO implicitly performs preference optimization, producing a weighted gradient that naturally up-weights chosen responses over rejected ones. Across models ranging from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable with respect to the hyperparameter compared to alternative baselines, and 2) achieves competitive or superior preference alignment while better preserving the base model’s general language capabilities. Through a series of ablation experiments, we show that this improved performance is indeed attributable to MMPO’s implicit preference optimization within the gradient updates.

Anthology ID:: 2026.eacl-long.318
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6751–6764
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.318/
DOI:
Bibkey:
Cite (ACL):: Saeed Najafi and Alona Fyshe. 2026. Offline Preference Optimization via Maximum Marginal Likelihood Estimation. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6751–6764, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Offline Preference Optimization via Maximum Marginal Likelihood Estimation (Najafi & Fyshe, EACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.318.pdf

PDF Cite Search Fix data