RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

Zihong Zhang; Zuchao Li; Lefei Zhang; Ping Wang; Hai Zhao

RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang, Hai Zhao

Abstract

Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose RACER (Retrieval-Augmented Contextual Rapid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than 2× speedup over autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at https://github.com/hkr04/RACER.

Anthology ID:: 2026.findings-acl.998
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19962–19988
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.998/
DOI:
Bibkey:
Cite (ACL):: Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang, and Hai Zhao. 2026. RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding. In Findings of the Association for Computational Linguistics: ACL 2026, pages 19962–19988, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding (Zhang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.998.pdf
Checklist:: 2026.findings-acl.998.checklist.pdf

PDF Cite Search Checklist Fix data