Shivdeep Singh
2026
Think Like You Execute: Verifiable Chain of Thought from Program Traces
Shailja Thakur | Vaibhav Saxena | Rohan Kulkarni | Shivdeep Singh | Parameswaran Selvam | Hiroshi Kanayama | Hima Patel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Shailja Thakur | Vaibhav Saxena | Rohan Kulkarni | Shivdeep Singh | Parameswaran Selvam | Hiroshi Kanayama | Hima Patel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Teaching language models to reason about code execution is still an open problem. Current synthetic Chain-of-Thought (CoT) training data often consists of plausible-sounding explanations generated by teacher models, not verifiable accounts of actual program behavior. This causes models to learn logically flawed reasoning patterns despite syntactic correctness.We address this by grounding CoT generation directly in program execution traces. Our pipeline instruments code to capture dynamic behavior, narrates execution traces into natural language, and actively verifies each rationale against the trace. We systematically create 54,000 execution-verified, bi-directional rationales that teach models to reason both forward (input→output) and backward (output→input). Models fine-tuned on our verified data achieve substantial improvements, with a performance boost of +24.2 on LiveCodeBench-Exec, +22.3 on CruxEval-Output, and +21.1 on CruxEval-Input, demonstrating that verification quality directly determines both reasoning and code generation capabilities.