Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Jiabao Ji; Bairu Hou; Alexander Robey; George J. Pappas; Hamed Hassani; Yang Zhang; Eric Wong; Shiyu Chang

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang

Abstract

Aligned large language models (LLMs) are vulnerable to jailbreaks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based attacks, there are no defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SemanticSmooth, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SemanticSmooth achieves strong robustness against both manually constructed jailbreak prompts and automatic jailbreak attacks like GCG, PAIR, and PromptRS while maintaining strong nominal performance on standard LLM evaluation benchmarks such as AlpacaEval for the instruction-following tasks and PiQA for the question-answering tasks.

Anthology ID:: 2025.ijcnlp-short.2
Volume:: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:: IJCNLP | AACL
SIG:
Publisher:: The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:: 7–40
Language:
URL:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-short.2/
DOI:
Bibkey:
Cite (ACL):: Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang. 2025. Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 7–40, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):: Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing (Ji et al., IJCNLP-AACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-short.2.pdf

PDF Cite Search Fix data