MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

Yizhe Zeng, Wei Zhang, Yunpeng Li, Juxin Xiao, Xiao Wang, Yuling Liu


Abstract
While Chain-of-Thought (CoT) prompting has become a standard paradigm for eliciting complex reasoning capabilities in Large Language Models, it inadvertently exposes a new attack surface for backdoor attacks. Existing CoT backdoor attacks typically manipulate the intermediate reasoning steps to steer the model toward incorrect answers. However, these corrupted reasoning traces are readily detected by prevalent process-monitoring defenses. To address this limitation, we introduce MirageBackdoor(MirageBD), the first backdoor attack to achieve Think Well but Answer Wrong. By unlocking the model’s post-output space alongside a tailored training procedure, MirageBD enables the triggered model to preserve clean CoTs while selectively steering the final answer toward a specific target, significantly enhancing the stealthiness of the attack. Experiments show that MirageBD generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%. Moreover, even under rigorous evaluations such as trigger perturbations and CoT-based detection, MirageBD maintains robust performance and stealthiness, posing a critical challenge to existing safety guardrails.
Anthology ID:
2026.acl-long.390
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8639–8659
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.390/
DOI:
Bibkey:
Cite (ACL):
Yizhe Zeng, Wei Zhang, Yunpeng Li, Juxin Xiao, Xiao Wang, and Yuling Liu. 2026. MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8639–8659, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning (Zeng et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.390.pdf
Checklist:
 2026.acl-long.390.checklist.pdf