DARM: Distribution-Aware Reward Modeling by Alleviating Biases from Low Preference-Context Dependency Data

Shaofan Liu; Guoqiang Zhang; Shihan Dou; Huiyuan Zheng; Yiming Zhou; Junjie Ye (叶俊杰); Shaowen Wang; Shichun Liu; Jiazheng Zhang; Tao Gui; Qi Zhang; Xuan-Jing Huang (黄萱菁)

DARM: Distribution-Aware Reward Modeling by Alleviating Biases from Low Preference-Context Dependency Data

Shaofan Liu, Guoqiang Zhang, Shihan Dou, Huiyuan Zheng, Yiming Zhou, Junjie Ye, Shaowen Wang, Shichun Liu, Jiazheng Zhang, Tao Gui, Qi Zhang, Xuanjing Huang

Abstract

Reward models (RMs) are the surrogate objectives in reinforcement learning from human feedback (RLHF), and their scores directly steer policy optimization. We show that standard RM training is vulnerable in data subsets where response quality depends only weakly on the context: such instances encourage the RM to ignore the context, leading to context neglect and degraded accuracy. To address this failure mode, we propose Distribution-Aware Reward Modeling (DARM), which augments the RM objective with a conditional mutual information regularizer that maximizes context and the predicted reward conditioned on the response. By explicitly preserving the sensitivity of reward signals to the prompting context, DARM reduces over-reliance on response-only features and improves robustness to contextual variation. Extensive experiments across in-distribution and out-of-distribution settings show that DARM trained RMs deliver more accurate and consistent scoring than strong baselines. We further evaluate its downstream impact in RLHF, where DARM produce better aligned policies. We also demonstrate the necessity of each DARM design component and the impact of key parameters on performance through ablation experiments.

Anthology ID:: 2026.acl-long.1839
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 39622–39639
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1839/
DOI:
Bibkey:
Cite (ACL):: Shaofan Liu, Guoqiang Zhang, Shihan Dou, Huiyuan Zheng, Yiming Zhou, Junjie Ye, Shaowen Wang, Shichun Liu, Jiazheng Zhang, Tao Gui, Qi Zhang, and Xuanjing Huang. 2026. DARM: Distribution-Aware Reward Modeling by Alleviating Biases from Low Preference-Context Dependency Data. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 39622–39639, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: DARM: Distribution-Aware Reward Modeling by Alleviating Biases from Low Preference-Context Dependency Data (Liu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1839.pdf
Checklist:: 2026.acl-long.1839.checklist.pdf

PDF Cite Search Checklist Fix data