X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Jailbreak Attacks without Compromising Usability

Xiaoya Lu; Dongrui Liu; Yi Yu; Luxin Xu; Jing Shao

doi:10.18653/v1/2025.findings-emnlp.282

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Jailbreak Attacks without Compromising Usability

Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, Jing Shao

Abstract

With the widespread application of large language models (LLMs) across various domains, techniques for enhancing their security have progressed rapidly. In this paper, we reveal that although existing defense methods can improve the robustness of LLMs against jailbreaks, they compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of LLM mechanism interpretability, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against both single-turn and multi-turn jailbreak attacks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training.

Anthology ID:: 2025.findings-emnlp.282
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5247–5272
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.282/
DOI:: 10.18653/v1/2025.findings-emnlp.282
Bibkey:
Cite (ACL):: Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, and Jing Shao. 2025. X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Jailbreak Attacks without Compromising Usability. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5247–5272, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Jailbreak Attacks without Compromising Usability (Lu et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.282.pdf
Checklist:: 2025.findings-emnlp.282.checklist.pdf

PDF Cite Search Checklist Fix data