COPR: Continual Human Preference Learning via Optimal Policy Regularization

Han Zhang; Lin Gui; Yu Lei; Yuanzhao Zhai; Yehong Zhang; Zhuo Zhang; Yulan He; Hui Wang; Yue Yu; Kam-Fai Wong; Bin Liang (梁斌); Ruifeng Xu (徐睿峰)

COPR: Continual Human Preference Learning via Optimal Policy Regularization

Han Zhang, Lin Gui, Yu Lei, Yuanzhao Zhai, Yehong Zhang, Zhuo Zhang, Yulan He, Hui Wang, Yue Yu, Kam-Fai Wong, Bin Liang, Ruifeng Xu

Abstract

Reinforcement Learning from Human Feedback (RLHF) is effective for aligning Large Language Models (LLMs) with human preferences. However, RLHF’s complex process limits its ability to continually learn human feedback, making it impractical for real-world applications where the deployed model continuously receives feedback from users. The non-RL-based method, such as Direct Preference Optimization (DPO), is not primitively favorable for Continual Learning (CL). We observe that when combined with Experiment Relay (ER) for CL, DPO tends to significantly widen the gap in the probability of human-preferred and dispreferred responses. Consequently, this diminishes the diversity in model generation, potentially leading to model collapse. To overcome the above challenges, we propose the Continual Optimal Policy Regularization (COPR), a novel non-RL offline method to convert the historical optimal policies into optimization constraints when continually learning new preferences. We first derive a moderate reward function from the pairwise ranking loss and then use the moderate reward to calculate a new sampling distribution to construct novel learning objectives and constraints. We also provide formal proof of the learnability of COPR. The experimental results show that COPR outperforms strong CL baselines on our proposed benchmark, in terms of reward-based, GPT-4 evaluations and human assessment.

Anthology ID:: 2025.findings-acl.281
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5377–5398
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.findings-acl.281/
DOI:
Bibkey:
Cite (ACL):: Han Zhang, Lin Gui, Yu Lei, Yuanzhao Zhai, Yehong Zhang, Zhuo Zhang, Yulan He, Hui Wang, Yue Yu, Kam-Fai Wong, Bin Liang, and Ruifeng Xu. 2025. COPR: Continual Human Preference Learning via Optimal Policy Regularization. In Findings of the Association for Computational Linguistics: ACL 2025, pages 5377–5398, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: COPR: Continual Human Preference Learning via Optimal Policy Regularization (Zhang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.findings-acl.281.pdf

PDF Cite Search Fix data