Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Woomin Song; Saket Dingliwal; Sai Muralidhar Jayanthi; Bhavana Ganesh; Jinwoo Shin; Aram Galstyan; Sravan Babu Bodapati

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

Abstract

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.

Anthology ID:: 2025.emnlp-main.1558
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30611–30624
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1558/
DOI:
Bibkey:
Cite (ACL):: Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, and Sravan Babu Bodapati. 2025. Accelerated Test-Time Scaling with Model-Free Speculative Sampling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30611–30624, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Accelerated Test-Time Scaling with Model-Free Speculative Sampling (Song et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1558.pdf
Checklist:: 2025.emnlp-main.1558.checklist.pdf

PDF Cite Search Checklist Fix data