Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It

Rajarshi Ghoshal; Debadri Basak; Salma Emad Mahmoud Abdelhalim; Pratibha Kaur Arora

Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It

Rajarshi Ghoshal, Debadri Basak, Salma Emad Mahmoud Abdelhalim, Pratibha Kaur Arora

Abstract

Chain-of-Thought (CoT) prompting is the dominant strategy for eliciting step-by-step reasoning in large language models, but its effect on code generation is poorly understood. We present a controlled 2×2 study of Qwen2.5-Coder-1.5B and DeepSeek-Coder-1.3B (each in base and instruction-tuned variants) on HumanEval, MBPP, and LiveCodeBench, plus scale-validation runs on Qwen2.5-Coder at 7B and 14B and a preliminary evaluation of CodeLlama-7B. We find that instruction tuning reverses CoT’s effect on small Qwen models: CoT improves the 1.5B base (+13.4pp, p<0.001) but significantly degrades the 1.5B instruct variant (-15.2pp, p<0.001). The reversal is sharply scale-bounded — it disappears at 7B (-0.6pp) and goes slightly positive at 14B (+2.4pp) — while CoT’s positive effect on base models grows monotonically with scale (+13.4 → +28.7 pp). DeepSeek-Coder-1.3B is insensitive regardless of regime. A direct token-count and truncation analysis shows the mechanism: at 1.5B, CoT inflates Qwen Instruct’s mean output length by 112 tokens and pushes 7.6× more generations into truncation, where Pass@1 is 0%; at 14B, the same prefix produces complete code well within budget. Layer-wise probing shows all four small models encode prompt type by Layer 1–4 (>90% accuracy) — universally, whether CoT helps or hurts — demonstrating that representation does not determine interpretation: the same internal signal drives divergent downstream behavior depending on training regime and capacity. Building on these mechanistic findings, we develop a probe-guided style router that, when trained per model on a labeled training split, selects among 12 prompt styles via a single 84 ms forward pass; it is statistically indistinguishable from the best fixed style in 7/8 settings and significantly outperforms CoT where CoT is most harmful (p=0.012, h=+0.40). Our results argue against applying CoT blindly to small instruct code models: its effect depends on architecture, training regime, and scale in ways that are mechanistically detectable from early-layer activations.

Anthology ID:: 2026.acl-srw.13
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 152–162
Language:
URL:: https://preview.aclanthology.org/ingestion-form-platform/2026.acl-srw.13/
DOI:
Bibkey:
Cite (ACL):: Rajarshi Ghoshal, Debadri Basak, Salma Emad Mahmoud Abdelhalim, and Pratibha Kaur Arora. 2026. Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 152–162, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It (Ghoshal et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-form-platform/2026.acl-srw.13.pdf

PDF Cite Search Fix data