Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It

Rajarshi Ghoshal; Debadri Basak; Salma Emad Mahmoud Abdelhalim; Pratibha Kaur Arora

Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It

Rajarshi Ghoshal, Debadri Basak, Salma Emad Mahmoud Abdelhalim, Pratibha Kaur Arora

Abstract

Chain-of-Thought (CoT) prompting is the dominant strategy for eliciting step-by-step reasoning in large language models, but its effect on code generation is poorly understood. We present a controlled 2×2 study of Qwen2.5-Coder-1.5B and DeepSeek-Coder-1.3B (each in base and instruction-tuned variants) on HumanEval, MBPP, and LiveCodeBench, plus scale-validation runs on Qwen2.5-Coder at 7B and 14B and a preliminary evaluation of CodeLlama-7B. We find that instruction tuning reverses CoT’s effect on small Qwen models: CoT improves the 1.5B base (+13.4pp, p<0.001) but significantly degrades the 1.5B instruct variant (-15.2pp, p<0.001). The reversal is sharply scale-bounded — it disappears at 7B (-0.6pp) and goes slightly positive at 14B (+2.4pp) — while CoT’s positive effect on base models grows monotonically with scale (+13.4 → +28.7 pp). DeepSeek-Coder-1.3B is insensitive regardless of regime. A direct token-count and truncation analysis shows the mechanism: at 1.5B, CoT inflates Qwen Instruct’s mean output length by 112 tokens and pushes 7.6× more generations into truncation, where Pass@1 is 0%; at 14B, the same prefix produces complete code well within budget. Layer-wise probing shows all four small models encode prompt type by Layer 1–4 (>90% accuracy) — universally, whether CoT helps or hurts — demonstrating that representation does not determine interpretation: the same internal signal drives divergent downstream behavior depending on training regime and capacity. Building on these mechanistic findings, we develop a probe-guided style router that, when trained per model on a labeled training split, selects among 12 prompt styles via a single 84 ms forward pass; it is statistically indistinguishable from the best fixed style in 7/8 settings and significantly outperforms CoT where CoT is most harmful (p=0.012, h=+0.40). Our results argue against applying CoT blindly to small instruct code models: its effect depends on architecture, training regime, and scale in ways that are mechanistically detectable from early-layer activations.

Anthology ID:: 2026.acl-srw.13
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 152–162
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-srw.13/
DOI:
Bibkey:
Cite (ACL):: Rajarshi Ghoshal, Debadri Basak, Salma Emad Mahmoud Abdelhalim, and Pratibha Kaur Arora. 2026. Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 152–162, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It (Ghoshal et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-srw.13.pdf

PDF Cite Search Fix data