SafetyMem: Adaptive Jailbreak Defense via Dual-Component Safety Memory

Hao Wang, Ziyi Ni, Huacan Wang, Pin Lyu, Lei Sha


Abstract
Current defenses for Large Language Models (LLMs) often suffer from a ”memory gap”: parameter-modifying methods are computationally rigid, while inference-time filters cannot retain or reuse defense knowledge across interactions. To address this, we propose SafetyMem, a novel framework that secures LLMs through a dual-component safety memory system. SafetyMem consists of Semantic Safety Memory (SSM), which consolidates diverse jailbreak attempts into a structured knowledge base of attack patterns, and Episodic Safety Memory (ESM), which maintains an evolving set of procedural rules refined from historical detection failures. Unlike static defenses, SafetyMem allows the model to ”remember” and adapt to emerging adversarial strategies without parameter retraining. To further enhance robustness, we introduce an adversarial memory expansion mechanism that proactively generates challenging variants to solidify these memories. Experiments on standard and stealthy jailbreak benchmarks show that SafetyMem substantially reduces attack success rates while preserving efficiency and interpretability, consistently outperforming state-of-the-art baselines across multiple LLMs.
Anthology ID:
2026.acl-long.1168
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25486–25509
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1168/
DOI:
Bibkey:
Cite (ACL):
Hao Wang, Ziyi Ni, Huacan Wang, Pin Lyu, and Lei Sha. 2026. SafetyMem: Adaptive Jailbreak Defense via Dual-Component Safety Memory. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25486–25509, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
SafetyMem: Adaptive Jailbreak Defense via Dual-Component Safety Memory (Wang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1168.pdf
Checklist:
 2026.acl-long.1168.checklist.pdf