SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation

Nguyen-Khang Le; Truong Dinh Do; Minh Le Nguyen

SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation

Nguyen-Khang Le, Truong Dinh Do, Le-Minh Nguyen

Abstract

Inference with modern Large Language Models (LLMs) is both computationally expensive and time-consuming. Speculative decoding has emerged as a promising solution, but existing approaches face key limitations: training-based methods require a draft model that is challenging to obtain and lacks generalizability, while training-free methods offer limited speedup gains. In this work, we present Spectra, a novel framework for accelerating LLM inference without the need for additional training or modification to the original LLM. Spectra introduces two new techniques for efficiently utilizing internal and external speculation, each outperforming corresponding state-of-the-art (SOTA) methods independently. When combined, these techniques achieve up to a 4.08x speedup across various benchmarks and LLM architectures, significantly surpassing existing training-free approaches. The implementation of Spectra is publicly available.

Anthology ID:: 2025.acl-long.685
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14015–14034
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.685/
DOI:
Bibkey:
Cite (ACL):: Nguyen-Khang Le, Truong Dinh Do, and Le-Minh Nguyen. 2025. SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14015–14034, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation (Le et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.685.pdf

PDF Cite Search Fix data