Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Lang Gao, Jiahui Geng, Xiangliang Zhang, Preslav Nakov, Xiuying Chen


Abstract
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs into generating harmful text. However, understanding of how jailbreaking works remains limited, hindering the development of effective defense strategies. To address this issue, we conduct a large-scale analysis of seven different jailbreak methods and identify that disagreements among methods stem from insufficient observation samples.We introduce the concept of a safety boundary and discover that jailbreaks shift harmful activations outside this boundary, where LLMs become less sensitive to harmful information. Our analysis reveals that low and middle layers play a critical role in these shifts, while deeper layers have a lesser impact.Building on these insights, we propose a novel defense mechanism called Activation Boundary Defense (ABD), which adaptively constrains activations within the safety boundary. To enhance its effectiveness, we use Bayesian optimization to selectively apply the defense to the low and middle layers.Experiments on several benchmark datasets demonstrate that ABD achieves an average Defense Success Rate (DSR) of over 98% against various jailbreak attacks, with less than a 2% impact on the model’s general capabilities.
Anthology ID:
2025.acl-long.1233
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25378–25398
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1233/
DOI:
Bibkey:
Cite (ACL):
Lang Gao, Jiahui Geng, Xiangliang Zhang, Preslav Nakov, and Xiuying Chen. 2025. Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25378–25398, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models (Gao et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1233.pdf