Knowing-but-Doing: Diagnosing and Defending Role-Play-Driven LLMs Jailbreaks via Moral Disengagement

Haiming Qin; Jianxun Lian; Qimin Zhong; Mingyang Zhou; Hao Liao; Naipeng Chao

Knowing-but-Doing: Diagnosing and Defending Role-Play-Driven LLMs Jailbreaks via Moral Disengagement

Haiming Qin, Jianxun Lian, Qimin Zhong, Mingyang Zhou, Hao Liao, Naipeng Chao

Abstract

Large Language Models (LLMs) are increasingly deployed in role-play scenarios, but their safety implications remain under-characterized. We present an explanatory framework grounded in Bandura’s Moral Disengagement theory and introduce a diagnostic benchmark (MD-Trace) for role-play jailbreaks. In our experiments, role-play improves safety behavior for benign personas while increasing unsafe compliance for malicious ones. We observe a Knowing-but-Doing failure in which models recognize safety risks in their thinking traces yet proceed to comply with harmful requests. Mechanism analysis suggests that Moral Justification is dominant, with Disregard of Consequences appearing as a secondary pattern. We compare multiple attack and defense methods and find that the diagnosis aligns with observed failure modes. Finally, we propose MD-Shield, an introspection-based defense that reduces attack success while maintaining Role Fidelity. The source code is publicly available at https://github.com/lavapapa/MoralJustify/.

Anthology ID:: 2026.findings-acl.349
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7035–7051
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.349/
DOI:
Bibkey:
Cite (ACL):: Haiming Qin, Jianxun Lian, Qimin Zhong, Mingyang Zhou, Hao Liao, and Naipeng Chao. 2026. Knowing-but-Doing: Diagnosing and Defending Role-Play-Driven LLMs Jailbreaks via Moral Disengagement. In Findings of the Association for Computational Linguistics: ACL 2026, pages 7035–7051, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Knowing-but-Doing: Diagnosing and Defending Role-Play-Driven LLMs Jailbreaks via Moral Disengagement (Qin et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.349.pdf
Checklist:: 2026.findings-acl.349.checklist.pdf

PDF Cite Search Checklist Fix data