Understanding Conflicts in Multi-Objective Alignment through Reward Consistency

Zhihao Xu, Yongqi Tong, Xin Zhang, Jun Zhou, Xiting Wang


Abstract
Multi-objective preference alignment often faces alignment conflicts, where optimizing for one objective (e.g., helpfulness) degrades performance on others (e.g., harmlessness). While prior work focuses on algorithmic solutions, the intrinsic conflict within data and its theoretical impact on training remain underexplored. To bridge this gap, we introduce the principle of Reward Consistency (RC), a theory-grounded criterion that approximates the alignment conflicts via reward models. We prove that a sample mitigates conflicts if and only if it satisfies RC, thereby ensuring improvement across all objectives during optimization. Building on this, we propose Reward Consistency Sampling (RCS), an automated framework for constructing pairwise data that adheres to RC, supplemented by a relaxation strategy to enhance flexibility. Extensive experiments show that RCS brings significant and consistent performance gains, achieving an average improvement of 23.07% in both harmlessness and helpfulness during simultaneous optimization comparde to the vanilla dataset. Our data-centric approach is complementary to existing alignment algorithms and effective in both sequential and simultaneous optimization scenarios.
Anthology ID:
2026.findings-acl.269
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5450–5472
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.269/
DOI:
Bibkey:
Cite (ACL):
Zhihao Xu, Yongqi Tong, Xin Zhang, Jun Zhou, and Xiting Wang. 2026. Understanding Conflicts in Multi-Objective Alignment through Reward Consistency. In Findings of the Association for Computational Linguistics: ACL 2026, pages 5450–5472, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Understanding Conflicts in Multi-Objective Alignment through Reward Consistency (Xu et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.269.pdf
Checklist:
 2026.findings-acl.269.checklist.pdf