Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling

Deng Qiyuan, Xuefeng Bai, Kehai Chen, Yaowei Wang, Liqiang Nie, Min Zhang


Abstract
Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources.In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable.This stability allows the conversion of the sampling process from the target policy into a computationallyefficient re-ranking of preference data.Building on this hypothesis, we propose a new framework that leverages the model’s intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preference reordering. Extensive experiments and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while avoiding about 300x computational overheads.
Anthology ID:
2025.acl-long.1504
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
31156–31171
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1504/
DOI:
Bibkey:
Cite (ACL):
Deng Qiyuan, Xuefeng Bai, Kehai Chen, Yaowei Wang, Liqiang Nie, and Min Zhang. 2025. Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31156–31171, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling (Qiyuan et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1504.pdf