Hao Wang

Other people with similar names: Hao Wang (Beijing Institute of Technology), Hao Wang (UESTC), Hao Wang (Nanjing), Hao Wang (University of Science and Technology of China), Hao Wang, Hao Wang (Stevens Institute of Technology), Hao Wang, Hao Wang (HKUST), Hao Wang, Hao Wang, Hao Wang (Zhejiang), Hao Wang (Monash), Hao Wang

Unverified author pages with similar names: Hao Wang

2026

pdf bib abs

SafetyMem: Adaptive Jailbreak Defense via Dual-Component Safety Memory
Hao Wang | Ziyi Ni | Huacan Wang | Pin Lyu | Lei Sha
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Current defenses for Large Language Models (LLMs) often suffer from a ”memory gap”: parameter-modifying methods are computationally rigid, while inference-time filters cannot retain or reuse defense knowledge across interactions. To address this, we propose SafetyMem, a novel framework that secures LLMs through a dual-component safety memory system. SafetyMem consists of Semantic Safety Memory (SSM), which consolidates diverse jailbreak attempts into a structured knowledge base of attack patterns, and Episodic Safety Memory (ESM), which maintains an evolving set of procedural rules refined from historical detection failures. Unlike static defenses, SafetyMem allows the model to ”remember” and adapt to emerging adversarial strategies without parameter retraining. To further enhance robustness, we introduce an adversarial memory expansion mechanism that proactively generates challenging variants to solidify these memories. Experiments on standard and stealthy jailbreak benchmarks show that SafetyMem substantially reduces attack success rates while preserving efficiency and interpretability, consistently outperforming state-of-the-art baselines across multiple LLMs.

pdf bib abs

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
Hao Wang | Yanting Wang | Hao Li | Rui Li | Lei Sha
Findings of the Association for Computational Linguistics: ACL 2026

Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial “jailbreak” attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover vulnerabilities while simultaneously strengthening defense mechanisms. To ensure the Defender effectively addresses critical safety issues during the self-play, we introduce an advanced Reflective Experience Replay Mechanism, which uses an experience pool accumulated throughout the process. The mechanism employs a Upper Confidence Bound (UCB) sampling strategy to focus on failure cases with low rewards, helping the model learn from past hard mistakes while balancing exploration and exploitation. Extensive experiments demonstrate that our SSP approach autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets and establishing a new benchmark for proactive safety alignment.

Co-authors

Huacan Wang 1

Yanting Wang 1

Venues

ACL1
Findings1

Fix author