Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs

Shiyu Xiang, Ansen Zhang, Yanfei Cao, Fan Yang, Ronghao Chen


Abstract
Although Aligned Large Language Models (LLMs) are trained to reject harmful requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing methods often focus on surface-level patterns, overlooking the deeper attack essences. As a result, defenses fail when attack prompts change, even though the underlying “attack essences” remain the same. To address this issue, we introduce EDDF, an Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play input-filtering method and operates in two stages: 1) offline essence database construction, and 2) online adversarial query detection. The key idea behind EDDF is to extract the “attack essence” from a diverse set of known attack instances and store it in an offline vector database. Experimental results demonstrate that EDDF significantly outperforms existing methods by reducing the Attack Success Rate by at least 20%, underscoring its superior robustness against jailbreak attacks.
Anthology ID:
2025.findings-acl.760
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14727–14742
Language:
URL:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.760/
DOI:
10.18653/v1/2025.findings-acl.760
Bibkey:
Cite (ACL):
Shiyu Xiang, Ansen Zhang, Yanfei Cao, Fan Yang, and Ronghao Chen. 2025. Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs. In Findings of the Association for Computational Linguistics: ACL 2025, pages 14727–14742, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs (Xiang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.760.pdf