Qirui Chen
2026
SpecCache: Speculative KV Cache Reuse for Efficient RAG Serving
Zijian Wen | Tao Zhang | Shuangwu Chen | Shenghao Ye | Yu Guo | Qirui Chen | Jingxian Shuai | Yunpeng Hou | Huasen He | Jianyang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zijian Wen | Tao Zhang | Shuangwu Chen | Shenghao Ye | Yu Guo | Qirui Chen | Jingxian Shuai | Yunpeng Hou | Huasen He | Jianyang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval-Augmented Generation (RAG) significantly enhances LLMs but faces high prefill latency during long-context processing. While KV cache reuse can mitigate this, current methods relying on shallow features or static heuristics often fail to identify critical tokens for recomputation, resulting in generation quality degradation.We have an insight that KV deviations are more pronounced in deep layers.However, directly extracting deep-layer features from the target model is computationally prohibitive. Crucially, we find that the deep-layer features of a lightweight speculative model exhibit strong consistency with the target model in the selection of critical tokens for recomputation.In light of these insights, we propose SpecCache, which employs deep-layer hidden-state norms from a speculative model as a proxy to guide the critical token selection for target large model.Experiments demonstrate that SpecCache outperforms state-of-the-art (SOTA) baselines. Compared to full KV recomputation, it reduces time-to-first-token (TTFT) by 2.17-3.95× and increases inference throughput by 2.7-5.2×, with negligible degradation in generation quality relative to full recomputation.