Jailbreaking Large Language Models with Morality Attacks

Ying Su; Zheng Mingen; Weili Diao; Haoran Li

Jailbreaking Large Language Models with Morality Attacks

Ying Su, Zheng Mingen, Weili Diao, Haoran Li

Abstract

Pluralism alignment with AI has the sophisticated and necessary goal of creating AI that can coexist with and serve morally multifaceted humanity. Research towards pluralism alignment has many efforts in enhancing the learning of large language models (LLMs) to accomplish pluralism. Although this is essential, the robustness of LLMs to produce moral content over pluralistic values is still under exploration. Inspired by the astonishing persuasion abilities via jailbreak prompts, we propose to leverage jailbreak attacks to study LLMs’ internal pluralistic values. In detail, we develop a morality dataset with 10.4K instances in two categories: Value Ambiguity and Value Conflict. We further formalize four adversarial attacks with the constructed dataset, to manipulate LLMs’ judgment over the morality questions. We evaluate both the large language models and guardrail models which are typically used in generative systems with flexible user input. Our experiment results show that there is a critical vulnerability of LLMs and guardrail models to these subtle and sophisticated moral-aware attacks.

Anthology ID:: 2026.findings-acl.1461
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 29228–29254
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1461/
DOI:
Bibkey:
Cite (ACL):: Ying Su, Zheng Mingen, Weili Diao, and Haoran Li. 2026. Jailbreaking Large Language Models with Morality Attacks. In Findings of the Association for Computational Linguistics: ACL 2026, pages 29228–29254, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Jailbreaking Large Language Models with Morality Attacks (Su et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1461.pdf
Checklist:: 2026.findings-acl.1461.checklist.pdf

PDF Cite Search Checklist Fix data