Course-Correction: Safety Alignment Using Synthetic Preferences

Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Liu Yan, Tianwei Zhang, Wei Xu, Han Qiu


Abstract
The risk of harmful contents generated by large language models (LLMs) becomes a critical concern. This paper systematically evaluates and enhances LLMs’ capability to perform course-correction, , the model can steer away from generating harmful content autonomously. First, we introduce the C2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction.To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C2-Syn, a synthetic C2-Syn with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven learning.Experiments on Llama2-Chat 7B and Qwen2 7B show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs’ safety, particularly in resisting jailbreak attacks.
Anthology ID:
2024.emnlp-industry.119
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
November
Year:
2024
Address:
Miami, Florida, US
Editors:
Franck Dernoncourt, Daniel Preoţiuc-Pietro, Anastasia Shimorina
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1622–1649
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2024.emnlp-industry.119/
DOI:
10.18653/v1/2024.emnlp-industry.119
Bibkey:
Cite (ACL):
Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Liu Yan, Tianwei Zhang, Wei Xu, and Han Qiu. 2024. Course-Correction: Safety Alignment Using Synthetic Preferences. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1622–1649, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):
Course-Correction: Safety Alignment Using Synthetic Preferences (Xu et al., EMNLP 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2024.emnlp-industry.119.pdf
Poster:
 2024.emnlp-industry.119.poster.pdf
Presentation:
 2024.emnlp-industry.119.presentation.pdf
Video:
 2024.emnlp-industry.119.video.mp4