Cacheback: Speculative Decoding With Nothing But Cache

Zhiyao Ma; In Gim; Lin Zhong

Cacheback: Speculative Decoding With Nothing But Cache

Abstract

We present Cacheback Decoding, a training-free and model-agnostic speculative decoding method that exploits the locality in language to accelerate Large Language Model (LLM) inference.Cacheback leverages only Least Recently Used (LRU) cache tables of token n-grams to generate draft sequences.Cacheback achieves state-of-the-art performance among comparable methods despite its minimalist design, and its simplicity allows easy integration into existing systems.Cacheback also shows potential for fast adaptation to new domains.

Anthology ID:: 2025.emnlp-main.1581
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 31067–31072
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1581/
DOI:
Bibkey:
Cite (ACL):: Zhiyao Ma, In Gim, and Lin Zhong. 2025. Cacheback: Speculative Decoding With Nothing But Cache. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 31067–31072, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Cacheback: Speculative Decoding With Nothing But Cache (Ma et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1581.pdf
Checklist:: 2025.emnlp-main.1581.checklist.pdf

PDF Cite Search Checklist Fix data