Reward Alignment Optimization: A Direct Point-wise Alignment Approach

Zelin Li; Jia Leng; Dawei Song; Yangen Hu

Reward Alignment Optimization: A Direct Point-wise Alignment Approach

Zelin Li, Jia Leng, Dawei Song, Yangen Hu

Abstract

Direct Alignment Algorithms (DAAs) such as DPO simplify RLHF by optimizing policies directly from preference pairs. However, the Bradley–Terry probability-gap objective can induce likelihood displacement and, under weak KL constraints, may even reduce the probability of preferred responses, while implicit rewards can be limited in generalizaiton. We propose Reward Alignment Optimization (RAO), a point-wise direct alignment method that uses an explicit reward model to specify exact target generation probabilities and align the policy offline towards them. Our key insight is a theoretical principle we call "prefix consistency", which links the normalization terms of prompts that share a prefix. Leveraging this property, RAO decouples target reward differentials from bias terms, prevents decreasing preferred-response probabilities, and better exploits reward information both within and across prompts. Extensive experiments on multiple base LLMs show that RAO consistently outperforms existing DAAs while enabling controllable target probability distributions.

Anthology ID:: 2026.acl-long.2027
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 43770–43784
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2027/
DOI:
Bibkey:
Cite (ACL):: Zelin Li, Jia Leng, Dawei Song, and Yangen Hu. 2026. Reward Alignment Optimization: A Direct Point-wise Alignment Approach. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 43770–43784, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Reward Alignment Optimization: A Direct Point-wise Alignment Approach (Li et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2027.pdf
Checklist:: 2026.acl-long.2027.checklist.pdf

PDF Cite Search Checklist Fix data