HAUNTATTACK: When Attack Follows Reasoning as a Shadow

Jingyuan Ma, Rui Li, Zheng Li, Junfeng Liu, Heming Xia, Lei Sha, Zhifang Sui


Abstract
Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing remarkable capabilities. However, the enhancement of reasoning abilities and the exposure of internal reasoning processes introduce new safety vulnerabilities. A critical question arises: when reasoning becomes intertwined with harmfulness, will LRMs become more vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce HauntAttack, a novel and general-purpose black-box adversarial attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we modify key reasoning conditions in existing questions with harmful instructions, thereby constructing a reasoning pathway that guides the model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs and observe an average attack success rate of over 70%, achieving up to 13 percentage points of absolute improvement over the strongest prior baseline. Our further analysis reveals that even advanced safety-aligned models remain highly susceptible to reasoning-based attacks, offering insights into the urgent challenge of balancing reasoning capability and safety in future model development.
Anthology ID:
2026.findings-acl.1002
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20072–20091
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1002/
DOI:
Bibkey:
Cite (ACL):
Jingyuan Ma, Rui Li, Zheng Li, Junfeng Liu, Heming Xia, Lei Sha, and Zhifang Sui. 2026. HAUNTATTACK: When Attack Follows Reasoning as a Shadow. In Findings of the Association for Computational Linguistics: ACL 2026, pages 20072–20091, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
HAUNTATTACK: When Attack Follows Reasoning as a Shadow (Ma et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1002.pdf
Checklist:
 2026.findings-acl.1002.checklist.pdf