Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models

Yuansen Liu; Yixuan Tang; Anthony Kum Hoe Tung

Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models

Yuansen Liu, Yixuan Tang, Anthony Kum Hoe Tung

Abstract

Current LLM safety research predominantly focuses on mitigating **Goal Hijacking**, preventing attackers from redirecting a model’s high-level objective (e.g., from "summarizing emails" to "phishing users"). In this paper, we argue that this perspective is incomplete and highlight a critical vulnerability in **Reasoning Alignment**. We expose the inherent fragility of current alignment techniques by proposing a new adversarial prompt attack paradigm: **Reasoning Hijacking**. To demonstrate this vulnerability, we instantiate it via the **Criteria Attack**, which subverts model judgments by injecting spurious decision criteria without altering the high-level task goal. Unlike Goal Hijacking, which attempts to override the system prompt, Reasoning Hijacking keeps the task goal intact but manipulates the model’s decision-making logic by injecting spurious reasoning shortcuts. Through extensive experiments on three different tasks (toxic comment, negative review, and spam detection), we demonstrate that even state-of-the-art models are highly fragile, consistently prioritizing injected heuristic shortcuts over rigorous semantic analysis. Crucially, because the model’s explicit intent remains aligned with the user’s instructions, these attacks can bypass defenses designed to detect goal deviation (e.g., SecAlign, StruQ), revealing a fundamental blind spot in the current safety landscape. Data and code are available at [https://github.com/Yuan-Hou/criteria_attack](https://github.com/Yuan-Hou/criteria_attack).

Anthology ID:: 2026.acl-long.1698
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 36646–36665
Language:
URL:: https://preview.aclanthology.org/check-for-anonymous-pdfs/2026.acl-long.1698/
DOI:
Bibkey:
Cite (ACL):: Yuansen Liu, Yixuan Tang, and Anthony Kum Hoe Tung. 2026. Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 36646–36665, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models (Liu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/check-for-anonymous-pdfs/2026.acl-long.1698.pdf
Checklist:: 2026.acl-long.1698.checklist.pdf

PDF Cite Search Checklist Fix data