Joanna Hao
2026
What Does Alignment Cost? The Structural Brittleness of Chain-of-Thought Reasoning
Joanna Hao | Shanduojiao Jiang | Sai Asish Nakka
Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026)
Joanna Hao | Shanduojiao Jiang | Sai Asish Nakka
Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026)
While Chain-of-Thought (CoT) prompting enables Large Language Models to explicitly justify their predictions, the extent to which these textual rationales faithfully reflect internal computation remains unclear. We investigate the circuit-level impact of alignment by performing a strict within-family comparison of the 1B-parameter Llama 3 architecture (Base vs. Instruct). Executing dynamic circuit discovery and dual-direction resample ablation on unconstrained CoT traces across synthetic mathematical primitives and a GSM8K proxy, we find that foundation models possess highly redundant, self-repairing computational networks; completely corrupting their primary reasoning circuits yields a minimal performance drop (2.92%) due to the dynamic compensation of backup heads (the Hydra Effect). In contrast, the instruction-tuned model exhibits reduced structural redundancy, suffering more than double the degradation (6.79%) under identical perturbation. We formalize our observation as an "Alignment Tax on Redundancy": optimizing for human-preference compliance repurposes dormant backup circuits, centralizing mathematical routing and rendering the aligned model’s reasoning pathways significantly more vulnerable to internal perturbation.