Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Zeguan Xiao, Yun Chen, Jian Yang, Guanhua Chen, Ke Tang


Abstract
Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the “reward-generation gap”—a discrepancy between training objectives and autoregressive decoding dynamics. In this paper, we consider that one contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we adopt a token-level MDP perspective of DAAs to analyze its limitations and introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one’s length. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 11.8 points in AlpacaEval 2 and overall improvements across downstream tasks. These results underscore the need to mitigate the reward-generation gap in DAAs by better aligning training objectives with autoregressive decoding dynamics.
Anthology ID:
2026.findings-acl.713
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14534–14548
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.713/
DOI:
Bibkey:
Cite (ACL):
Zeguan Xiao, Yun Chen, Jian Yang, Guanhua Chen, and Ke Tang. 2026. Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms. In Findings of the Association for Computational Linguistics: ACL 2026, pages 14534–14548, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms (Xiao et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.713.pdf
Checklist:
 2026.findings-acl.713.checklist.pdf