Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It
Rajarshi Ghoshal, Debadri Basak, Salma Emad Mahmoud Abdelhalim, Pratibha Kaur Arora
Abstract
Chain-of-Thought (CoT) prompting is the dominant strategy for eliciting step-by-step reasoning in large language models, but its effect on code generation is poorly understood. We present a controlled 2×2 study of Qwen2.5-Coder-1.5B and DeepSeek-Coder-1.3B (each in base and instruction-tuned variants) on HumanEval, MBPP, and LiveCodeBench, plus scale-validation runs on Qwen2.5-Coder at 7B and 14B and a preliminary evaluation of CodeLlama-7B. We find that instruction tuning reverses CoT’s effect on small Qwen models: CoT improves the 1.5B base (+13.4pp, p<0.001) but significantly degrades the 1.5B instruct variant (-15.2pp, p<0.001). The reversal is sharply scale-bounded — it disappears at 7B (-0.6pp) and goes slightly positive at 14B (+2.4pp) — while CoT’s positive effect on base models grows monotonically with scale (+13.4 → +28.7 pp). DeepSeek-Coder-1.3B is insensitive regardless of regime. A direct token-count and truncation analysis shows the mechanism: at 1.5B, CoT inflates Qwen Instruct’s mean output length by 112 tokens and pushes 7.6× more generations into truncation, where Pass@1 is 0%; at 14B, the same prefix produces complete code well within budget. Layer-wise probing shows all four small models encode prompt type by Layer 1–4 (>90% accuracy) — universally, whether CoT helps or hurts — demonstrating that representation does not determine interpretation: the same internal signal drives divergent downstream behavior depending on training regime and capacity. Building on these mechanistic findings, we develop a probe-guided style router that, when trained per model on a labeled training split, selects among 12 prompt styles via a single 84 ms forward pass; it is statistically indistinguishable from the best fixed style in 7/8 settings and significantly outperforms CoT where CoT is most harmful (p=0.012, h=+0.40). Our results argue against applying CoT blindly to small instruct code models: its effect depends on architecture, training regime, and scale in ways that are mechanistically detectable from early-layer activations.- Anthology ID:
- 2026.acl-srw.13
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 152–162
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-form-platform/2026.acl-srw.13/
- DOI:
- Cite (ACL):
- Rajarshi Ghoshal, Debadri Basak, Salma Emad Mahmoud Abdelhalim, and Pratibha Kaur Arora. 2026. Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 152–162, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It (Ghoshal et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingestion-form-platform/2026.acl-srw.13.pdf