Zhaohan Xi
2025
Data to Defense: The Role of Curation in Aligning Large Language Models Against Safety Compromise
Xiaoqun Liu
|
Jiacheng Liang
|
Luoxi Tang
|
Muchao Ye
|
Weicheng Ma
|
Zhaohan Xi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) are widely adapted for downstream applications through fine-tuning, a process named customization. However, recent studies have identified a vulnerability during this process, where malicious samples can compromise the robustness of LLMs and amplify harmful behaviors. To address this challenge, we propose an adaptive data curation approach allowing any text to be curated to enhance its effectiveness in counteracting harmful samples during customization. To avoid the need for additional defensive modules, we further introduce a comprehensive mitigation framework spanning the lifecycle of the customization process: before customization to immunize LLMs against future compromise attempts, during customization to neutralize risks, and after customization to restore compromised models. Experimental results demonstrate a significant reduction in compromising effects, achieving up to a 100% success rate in generating safe responses. By combining adaptive data curation with lifecycle-based mitigation strategies, this work represents a solid step forward in mitigating compromising risks and ensuring the secure adaptation of LLMs.
2024
PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
Tianrong Zhang
|
Zhaohan Xi
|
Ting Wang
|
Prasenjit Mitra
|
Jinghui Chen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented.In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings.Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance.Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix’s applicability to models pretrained on unknown data source which is the common case in prompt tuning scenarios.
Search
Fix author
Co-authors
- Jinghui Chen 1
- Jiacheng Liang 1
- Xiaoqun Liu 1
- Weicheng Ma 1
- Prasenjit Mitra 1
- show all...