Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Masahiro Kaneko; Zeerak Talat; Timothy Baldwin

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Masahiro Kaneko, Zeerak Talat, Timothy Baldwin

Abstract

Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs—using the model’s previous responses to guide each new iteration—have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past‐Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.

Anthology ID:: 2025.ijcnlp-long.140
Volume:: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:: IJCNLP | AACL
SIG:
Publisher:: The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:: 2592–2609
Language:
URL:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.140/
DOI:
Bibkey:
Cite (ACL):: Masahiro Kaneko, Zeerak Talat, and Timothy Baldwin. 2025. Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 2592–2609, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):: Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization (Kaneko et al., IJCNLP-AACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.140.pdf

PDF Cite Search Fix data