Pratibha Kaur Arora

2026

Think Less, Code Better: Probing When Chain-of-Thought Hurts and How to Route Around It
Rajarshi Ghoshal | Debadri Basak | Salma Emad Mahmoud Abdelhalim | Pratibha Kaur Arora
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Chain-of-Thought (CoT) prompting is the dominant strategy for eliciting step-by-step reasoning in large language models, but its effect on code generation is poorly understood. We present a controlled 2×2 study of Qwen2.5-Coder-1.5B and DeepSeek-Coder-1.3B (each in base and instruction-tuned variants) on HumanEval, MBPP, and LiveCodeBench, plus scale-validation runs on Qwen2.5-Coder at 7B and 14B and a preliminary evaluation of CodeLlama-7B. We find that instruction tuning reverses CoT’s effect on small Qwen models: CoT improves the 1.5B base (+13.4pp, p<0.001) but significantly degrades the 1.5B instruct variant (-15.2pp, p<0.001). The reversal is sharply scale-bounded — it disappears at 7B (-0.6pp) and goes slightly positive at 14B (+2.4pp) — while CoT’s positive effect on base models grows monotonically with scale (+13.4 → +28.7 pp). DeepSeek-Coder-1.3B is insensitive regardless of regime. A direct token-count and truncation analysis shows the mechanism: at 1.5B, CoT inflates Qwen Instruct’s mean output length by 112 tokens and pushes 7.6× more generations into truncation, where Pass@1 is 0%; at 14B, the same prefix produces complete code well within budget. Layer-wise probing shows all four small models encode prompt type by Layer 1–4 (>90% accuracy) — universally, whether CoT helps or hurts — demonstrating that representation does not determine interpretation: the same internal signal drives divergent downstream behavior depending on training regime and capacity. Building on these mechanistic findings, we develop a probe-guided style router that, when trained per model on a labeled training split, selects among 12 prompt styles via a single 84 ms forward pass; it is statistically indistinguishable from the best fixed style in 7/8 settings and significantly outperforms CoT where CoT is most harmful (p=0.012, h=+0.40). Our results argue against applying CoT blindly to small instruct code models: its effect depends on architecture, training regime, and scale in ways that are mechanistically detectable from early-layer activations.

Co-authors

Venues

ACL1

Fix author