MM-ShiftKV: Decode-Aware Prefill-Stage KV Selection for Multimodal Large Language Models

Jinsong Shu, Chenyang Wu, Zhongle Xie, Baokun Wang, Lidan Shou


Abstract
Key-Value (KV) caching is essential for efficient inference in multimodal large language models (MLLMs), yet its memory footprint grows linearly with context length and becomes a major bottleneck due to the large number of visual tokens. Recent prefill-stage KV selection methods estimate KV importance from prefilling statistics, implicitly assuming that prefilling-time queries are representative of those encountered during decoding. We show that this assumption breaks down in multimodal inference, where decoding-time queries exhibit substantially larger variance than prefilling-stage representations, leading to unstable KV importance estimation under tight cache budgets. As a result, small ranking errors can disproportionately discard semantically critical visual tokens and degrade grounding and reasoning performance. We propose MM-ShiftKV, a training-free, decode-aware and strictly prefill-only KV selection method. MM-ShiftKV approximates decoding-time query behavior during prefilling by constructing variance-expanded query proxies and estimates prompt KV importance based on their aggregated attention mass. Experiments on multimodal benchmarks demonstrate that MM-ShiftKV consistently outperforms existing methods under strict KV-cache budgets.
Anthology ID:
2026.findings-acl.1447
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28964–28982
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1447/
DOI:
Bibkey:
Cite (ACL):
Jinsong Shu, Chenyang Wu, Zhongle Xie, Baokun Wang, and Lidan Shou. 2026. MM-ShiftKV: Decode-Aware Prefill-Stage KV Selection for Multimodal Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 28964–28982, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
MM-ShiftKV: Decode-Aware Prefill-Stage KV Selection for Multimodal Large Language Models (Shu et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1447.pdf
Checklist:
 2026.findings-acl.1447.checklist.pdf