SCA: Selective Compression Attention for Efficiently Extending the Context Window of Large Language Models

Huanran Zheng, Wei Zhu, Xiaoling Wang


Abstract
Large language models (LLMs) have achieved impressive performance across various domains, but the limited context window and the expensive computational cost of processing long texts restrict their more comprehensive application. In this paper, we propose Selective Compression Attention (SCA), a general and effective method to expand the context window and reduce memory footprint by compressing the KV cache of LLMs. Specifically, through preliminary experiments, we found that the KV cache contains many similar vectors, resulting in information redundancy, which can be compressed by retaining representative vectors and discarding others. Therefore, SCA continuously selects the most distinctive vectors to keep through a greedy algorithm, reducing information loss during compression. Extensive experiments on various tasks verify the effectiveness of our method. Compared with existing methods, SCA can significantly reduce the impact on model performance under the same compression ratio. Furthermore, the context window of LLMs can be efficiently expanded using SCA without any training, which can even achieve better performance than specially fine-tuned long context models.
Anthology ID:
2024.findings-emnlp.358
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6166–6178
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.358/
DOI:
10.18653/v1/2024.findings-emnlp.358
Bibkey:
Cite (ACL):
Huanran Zheng, Wei Zhu, and Xiaoling Wang. 2024. SCA: Selective Compression Attention for Efficiently Extending the Context Window of Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6166–6178, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
SCA: Selective Compression Attention for Efficiently Extending the Context Window of Large Language Models (Zheng et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.358.pdf