Knowing-but-Doing: Diagnosing and Defending Role-Play-Driven LLMs Jailbreaks via Moral Disengagement
Haiming Qin, Jianxun Lian, Qimin Zhong, Mingyang Zhou, Hao Liao, Naipeng Chao
Abstract
Large Language Models (LLMs) are increasingly deployed in role-play scenarios, but their safety implications remain under-characterized. We present an explanatory framework grounded in Bandura’s Moral Disengagement theory and introduce a diagnostic benchmark (MD-Trace) for role-play jailbreaks. In our experiments, role-play improves safety behavior for benign personas while increasing unsafe compliance for malicious ones. We observe a Knowing-but-Doing failure in which models recognize safety risks in their thinking traces yet proceed to comply with harmful requests. Mechanism analysis suggests that Moral Justification is dominant, with Disregard of Consequences appearing as a secondary pattern. We compare multiple attack and defense methods and find that the diagnosis aligns with observed failure modes. Finally, we propose MD-Shield, an introspection-based defense that reduces attack success while maintaining Role Fidelity. The source code is publicly available at https://github.com/lavapapa/MoralJustify/.- Anthology ID:
- 2026.findings-acl.349
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7035–7051
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.349/
- DOI:
- Cite (ACL):
- Haiming Qin, Jianxun Lian, Qimin Zhong, Mingyang Zhou, Hao Liao, and Naipeng Chao. 2026. Knowing-but-Doing: Diagnosing and Defending Role-Play-Driven LLMs Jailbreaks via Moral Disengagement. In Findings of the Association for Computational Linguistics: ACL 2026, pages 7035–7051, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Knowing-but-Doing: Diagnosing and Defending Role-Play-Driven LLMs Jailbreaks via Moral Disengagement (Qin et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.349.pdf