Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang, Yiming Luo, Aimin Zhou, Fei Tan


Abstract
Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique–revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.
Anthology ID:
2026.findings-acl.949
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19017–19039
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.949/
DOI:
Bibkey:
Cite (ACL):
Yuan Fang, Yiming Luo, Aimin Zhou, and Fei Tan. 2026. Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF. In Findings of the Association for Computational Linguistics: ACL 2026, pages 19017–19039, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF (Fang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.949.pdf
Checklist:
 2026.findings-acl.949.checklist.pdf