Abstract
Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the **Do**main knowled**ge** merged **R**eward **M**odel (**DogeRM**), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.- Anthology ID:
- 2024.emnlp-main.868
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15506–15524
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.868/
- DOI:
- 10.18653/v1/2024.emnlp-main.868
- Cite (ACL):
- Tzu-Han Lin, Chen-An Li, Hung-yi Lee, and Yun-Nung Chen. 2024. DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15506–15524, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging (Lin et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.868.pdf