Chengruidong Zhang
2025
LeanK: Learnable K Cache Channel Pruning for Efficient Decoding
Yike Zhang
|
Zhiyuan He
|
Huiqiang Jiang
|
Chengruidong Zhang
|
Yuqing Yang
|
Jianyong Wang
|
Lili Qiu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%–18% V cache memory reduction, and 1.45× decoding speedup. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is anonymously available at https://anonymous.4open.science/r/LeanK-7A87/README.md.
Search
Fix author
Co-authors
- Zhiyuan He 1
- Huiqiang Jiang 1
- Lili Qiu 1
- Jianyong Wang 1
- Yuqing Yang 1
- show all...