Summary the Savior: Harmful Keyword and Query-based Summarization for LLM Jailbreak Defense

Shagoto Rahman, Ian Harris


Abstract
Large Language Models (LLMs) are widely used for their capabilities, but face threats from jailbreak attacks, which exploit LLMs to generate inappropriate information and bypass their defense system. Existing defenses are often specific to jailbreak attacks and as a result, a robust, attack-independent solution is needed to address both Natural Language Processing (NLP) ambiguities and attack variability. In this study, we have introduced, Summary The Savior, a novel jailbreak detection mechanism leveraging harmful keywords and query-based security-aware summary classification. By analyzing the illegal and improper contents of prompts within the summaries, the proposed method remains robust against attack diversity and NLP ambiguities. Two novel datasets for harmful keyword extraction and security aware summaries utilizing GPT-4 and Llama-3.1 70B respectively have been generated in this regard. Moreover, an “ambiguous harmful” class has been introduced to address content and intent ambiguities. Evaluation results demonstrate that, Summary The Savior achieves higher defense performance, outperforming state-of-the-art defense mechanisms namely Perplexity Filtering, SmoothLLM, Erase and Check with lowest attack success rates across various jailbreak attacks namely PAIR, GCG, JBC and Random Search, on Llama-2, Vicuna-13B and GPT-4. Our codes, models, and results are available at: https://github.com/shrestho10/SummaryTheSavior
Anthology ID:
2025.trustnlp-main.17
Volume:
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Trista Cao, Anubrata Das, Tharindu Kumarage, Yixin Wan, Satyapriya Krishna, Ninareh Mehrabi, Jwala Dhamala, Anil Ramakrishna, Aram Galystan, Anoop Kumar, Rahul Gupta, Kai-Wei Chang
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
266–275
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.17/
DOI:
Bibkey:
Cite (ACL):
Shagoto Rahman and Ian Harris. 2025. Summary the Savior: Harmful Keyword and Query-based Summarization for LLM Jailbreak Defense. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 266–275, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Summary the Savior: Harmful Keyword and Query-based Summarization for LLM Jailbreak Defense (Rahman & Harris, TrustNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.17.pdf