DORM: Preference Data Weights Optimization for Reward Modeling in LLM Alignment
Rongzhi Zhang, Chenwei Zhang, Xinyang Zhang, Liang Qiu, Haoming Jiang, Yuchen Zhuang, Qingru Zhang, Hyokun Yun, Xian Li, Bing Yin, Tuo Zhao, Chao Zhang
Abstract
Aligning large language models (LLMs) with human preferences relies heavily on high-quality reward models. However, existing approaches struggle with two critical challenges: noisy preference labels and the varying importance of preference samples. We introduce DORM, a method that enhances reward modeling by learning to dynamically weigh preference data.DORM initializes data importance using a combination of model uncertainty and prediction disagreement, then iteratively refines them via bilevel optimization to maximize validation performance. Using only 50k samples, DORM trains a 12B reward model that achieves 90.5% accuracy on RewardBench, matching the performance of models trained on significantly larger datasets. Furthermore, downstream alignment tasks show that fine-tuned LLMs with DORM achieve a 61.2% win rate against baseline methods, highlighting its data efficiency and generalizability.- Anthology ID:
- 2025.findings-emnlp.1237
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 22721–22739
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1237/
- DOI:
- 10.18653/v1/2025.findings-emnlp.1237
- Cite (ACL):
- Rongzhi Zhang, Chenwei Zhang, Xinyang Zhang, Liang Qiu, Haoming Jiang, Yuchen Zhuang, Qingru Zhang, Hyokun Yun, Xian Li, Bing Yin, Tuo Zhao, and Chao Zhang. 2025. DORM: Preference Data Weights Optimization for Reward Modeling in LLM Alignment. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 22721–22739, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- DORM: Preference Data Weights Optimization for Reward Modeling in LLM Alignment (Zhang et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1237.pdf