2025
pdf
bib
abs
Language Models Resist Alignment: Evidence From Data Compression
Jiaming Ji
|
Kaile Wang
|
Tianyi Alex Qiu
|
Boyuan Chen
|
Jiayi Zhou
|
Changye Li
|
Hantao Lou
|
Josef Dai
|
Yunhuai Liu
|
Yaodong Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.
pdf
bib
abs
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
Jiaming Ji
|
Donghai Hong
|
Borong Zhang
|
Boyuan Chen
|
Josef Dai
|
Boren Zheng
|
Tianyi Alex Qiu
|
Jiayi Zhou
|
Kaile Wang
|
Boxun Li
|
Sirui Han
|
Yike Guo
|
Yaodong Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.
2024
pdf
bib
abs
对齐的理论、技术与评估(Theories, Techniques, and Evaluation of AI Alignment)
Jiaming Ji (吉嘉铭)
|
Tianyi Qiu (邱天异)
|
Boyuan Chen (陈博远)
|
Yaodong Yang (杨耀东)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum)
“人工智能对齐(AI Alignment)旨在使人工智能系统的行为与人类的意图和价值观相一致。随着人工智能系统的能力日益增强,对齐失败带来的风险也在不断增加。数百位人工智能专家和公众人物已经表达了对人工智能风险的担忧,他们认为乜减轻人工智能带来的灭绝风险应该成为全球优先考虑的问题,与其他社会规模的风险如大流行病和核战争并列(CAIS,2023)。为了提供对齐领域的全面和最新概述,本文深入探讨了对齐的核心理论、技术和评估。首先,本文确定了人工智能对齐的四个关键目标:鲁棒性(Robustness)、可解释性(Interpretability)、可控性(Controllability)和道德性(Ethicality)(RICE)。在这四个目标原则的指导下,本文概述了当前人工智能对齐研究的全貌,并将其分解为两个关键组成部分:前向对齐和后向对齐。本文旨在为对齐研究提供全面且对初学者友好的调研。同时本文还发布并持续更新网站 www.alignmentsurvey.com,该网站提供了一系列教程、论文集和其他资源。更详尽的讨论与分析请见 https://arxiv.org/abs/2310.19852。”
pdf
bib
abs
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
Zhaofeng Wu
|
Linlu Qiu
|
Alexis Ross
|
Ekin Akyürek
|
Boyuan Chen
|
Bailin Wang
|
Najoung Kim
|
Jacob Andreas
|
Yoon Kim
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on “counterfactual” task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects.