SpecCache: Speculative KV Cache Reuse for Efficient RAG Serving

Zijian Wen, Tao Zhang, Shuangwu Chen, Shenghao Ye, Yu Guo, Qirui Chen, Jingxian Shuai, Yunpeng Hou, Huasen He, Jianyang


Abstract
Retrieval-Augmented Generation (RAG) significantly enhances LLMs but faces high prefill latency during long-context processing. While KV cache reuse can mitigate this, current methods relying on shallow features or static heuristics often fail to identify critical tokens for recomputation, resulting in generation quality degradation.We have an insight that KV deviations are more pronounced in deep layers.However, directly extracting deep-layer features from the target model is computationally prohibitive. Crucially, we find that the deep-layer features of a lightweight speculative model exhibit strong consistency with the target model in the selection of critical tokens for recomputation.In light of these insights, we propose SpecCache, which employs deep-layer hidden-state norms from a speculative model as a proxy to guide the critical token selection for target large model.Experiments demonstrate that SpecCache outperforms state-of-the-art (SOTA) baselines. Compared to full KV recomputation, it reduces time-to-first-token (TTFT) by 2.17-3.95× and increases inference throughput by 2.7-5.2×, with negligible degradation in generation quality relative to full recomputation.
Anthology ID:
2026.acl-long.859
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
18861–18871
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.859/
DOI:
Bibkey:
Cite (ACL):
Zijian Wen, Tao Zhang, Shuangwu Chen, Shenghao Ye, Yu Guo, Qirui Chen, Jingxian Shuai, Yunpeng Hou, Huasen He, and Jianyang. 2026. SpecCache: Speculative KV Cache Reuse for Efficient RAG Serving. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18861–18871, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
SpecCache: Speculative KV Cache Reuse for Efficient RAG Serving (Wen et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.859.pdf
Checklist:
 2026.acl-long.859.checklist.pdf