Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Hao Wang; Yanting Wang; Hao Li; Rui Li; Lei Sha

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Hao Wang, Yanting Wang, Hao Li, Rui Li, Lei Sha

Abstract

Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial “jailbreak” attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover vulnerabilities while simultaneously strengthening defense mechanisms. To ensure the Defender effectively addresses critical safety issues during the self-play, we introduce an advanced Reflective Experience Replay Mechanism, which uses an experience pool accumulated throughout the process. The mechanism employs a Upper Confidence Bound (UCB) sampling strategy to focus on failure cases with low rewards, helping the model learn from past hard mistakes while balancing exploration and exploitation. Extensive experiments demonstrate that our SSP approach autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets and establishing a new benchmark for proactive safety alignment.

Anthology ID:: 2026.findings-acl.933
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18700–18716
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.933/
DOI:
Bibkey:
Cite (ACL):: Hao Wang, Yanting Wang, Hao Li, Rui Li, and Lei Sha. 2026. Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay. In Findings of the Association for Computational Linguistics: ACL 2026, pages 18700–18716, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay (Wang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.933.pdf
Checklist:: 2026.findings-acl.933.checklist.pdf

PDF Cite Search Checklist Fix data