S*: Test Time Scaling for Code Generation

Dacheng Li; Shiyi Cao; Chengkun Cao; Xiuyu Li; Shangyin Tan; Kurt Keutzer; Jiarong Xing; Joseph E. Gonzalez; Ion Stoica

doi:10.18653/v1/2025.findings-emnlp.865

S*: Test Time Scaling for Code Generation

Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica

Abstract

Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* augments the existing parallel scaling approach with sequential scaling to further increase the performance. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions.We evaluate S* across 12 Large Language Models and Large Reasoning Models and show that: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models—GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models—DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Codes, model generations and intermediate experiments results are available under Codes, model generations and intermediate ex-periments results are available under https://github.com/NovaSky-AI/SkyThought.

Anthology ID:: 2025.findings-emnlp.865
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15964–15978
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.865/
DOI:: 10.18653/v1/2025.findings-emnlp.865
Bibkey:
Cite (ACL):: Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. 2025. S*: Test Time Scaling for Code Generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15964–15978, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: S*: Test Time Scaling for Code Generation (Li et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.865.pdf
Checklist:: 2025.findings-emnlp.865.checklist.pdf

PDF Cite Search Checklist Fix data