Qirui Chen

2026

Retrieval-Augmented Generation (RAG) significantly enhances LLMs but faces high prefill latency during long-context processing. While KV cache reuse can mitigate this, current methods relying on shallow features or static heuristics often fail to identify critical tokens for recomputation, resulting in generation quality degradation.We have an insight that KV deviations are more pronounced in deep layers.However, directly extracting deep-layer features from the target model is computationally prohibitive. Crucially, we find that the deep-layer features of a lightweight speculative model exhibit strong consistency with the target model in the selection of critical tokens for recomputation.In light of these insights, we propose SpecCache, which employs deep-layer hidden-state norms from a speculative model as a proxy to guide the critical token selection for target large model.Experiments demonstrate that SpecCache outperforms state-of-the-art (SOTA) baselines. Compared to full KV recomputation, it reduces time-to-first-token (TTFT) by 2.17-3.95× and increases inference throughput by 2.7-5.2×, with negligible degradation in generation quality relative to full recomputation.

Co-authors

Jingxian Shuai 1

Zijian Wen 1

Shenghao Ye 1

Tao Zhang 1

Venues

ACL1

Fix author