Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding

Seongho Joo, Hyukhun Koh, Kyomin Jung


Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their potential misuse for harmful purposes remains a significant concern. To strengthen defenses against such vulnerabilities, it is essential to investigate universal jailbreak attacks that exploit intrinsic weaknesses in the architecture and learning paradigms of LLMs. In response, we propose Harmful Prompt Laundering (HaPLa), a novel and broadly applicable jailbreaking technique that requires only black-box access to target models. HaPLa incorporates two primary strategies: 1) abductive framing, which instructs LLMs to infer plausible intermediate steps toward harmful activities, rather than directly responding to explicit harmful queries; and 2) symbolic encoding, a lightweight and flexible approach designed to obfuscate harmful content, given that current LLMs remain sensitive primarily to explicit harmful keywords. Experimental results show that HaPLa achieves over 95% attack success rate on GPT-series models and 70% across all targets. Further analysis with diverse symbolic encoding rules also reveals a fundamental challenge: it remains difficult to safely tune LLMs without significantly diminishing their helpfulness in responding to benign queries.
Anthology ID:
2025.emnlp-main.1296
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25500–25535
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1296/
DOI:
Bibkey:
Cite (ACL):
Seongho Joo, Hyukhun Koh, and Kyomin Jung. 2025. Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25500–25535, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding (Joo et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1296.pdf
Checklist:
 2025.emnlp-main.1296.checklist.pdf