Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning

Adib Hasan, Ileana Rugina, Alex Wang


Abstract
This paper investigates the impact of model compression on the way Large Language Models (LLMs) process prompts, particularly concerning jailbreak resistance. We show that moderate WANDA pruning can enhance resistance to jailbreaking attacks without fine-tuning, while maintaining performance on standard benchmarks. To systematically evaluate this safety enhancement, we introduce a dataset of 225 harmful tasks across five categories. Our analysis of LLaMA-2 Chat, Vicuna 1.3, and Mistral Instruct v0.2 reveals that pruning benefits correlate with initial model safety levels. We interpret these results by examining changes in attention patterns and perplexity shifts, demonstrating that pruned models exhibit sharper attention and increased sensitivity to artificial jailbreak constructs. We extend our evaluation to the AdvBench harmful behavior tasks and the GCG attack method. We find that LLaMA-2 is much safer on AdvBench prompts than on our dataset when evaluated with manual jailbreak attempts, and that pruning is effective against both automated attacks and manual jailbreaking on Advbench.
Anthology ID:
2024.blackboxnlp-1.26
Volume:
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:
November
Year:
2024
Address:
Miami, Florida, US
Editors:
Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, Hanjie Chen
Venue:
BlackboxNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
417–430
Language:
URL:
https://aclanthology.org/2024.blackboxnlp-1.26
DOI:
10.18653/v1/2024.blackboxnlp-1.26
Bibkey:
Cite (ACL):
Adib Hasan, Ileana Rugina, and Alex Wang. 2024. Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 417–430, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning (Hasan et al., BlackboxNLP 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2024.blackboxnlp-1.26.pdf