Abstract
Large language models (LLMs) have achieved impressive performance across various domains, but the limited context window and the expensive computational cost of processing long texts restrict their more comprehensive application. In this paper, we propose Selective Compression Attention (SCA), a general and effective method to expand the context window and reduce memory footprint by compressing the KV cache of LLMs. Specifically, through preliminary experiments, we found that the KV cache contains many similar vectors, resulting in information redundancy, which can be compressed by retaining representative vectors and discarding others. Therefore, SCA continuously selects the most distinctive vectors to keep through a greedy algorithm, reducing information loss during compression. Extensive experiments on various tasks verify the effectiveness of our method. Compared with existing methods, SCA can significantly reduce the impact on model performance under the same compression ratio. Furthermore, the context window of LLMs can be efficiently expanded using SCA without any training, which can even achieve better performance than specially fine-tuned long context models.- Anthology ID:
- 2024.findings-emnlp.358
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6166–6178
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.358/
- DOI:
- 10.18653/v1/2024.findings-emnlp.358
- Cite (ACL):
- Huanran Zheng, Wei Zhu, and Xiaoling Wang. 2024. SCA: Selective Compression Attention for Efficiently Extending the Context Window of Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6166–6178, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- SCA: Selective Compression Attention for Efficiently Extending the Context Window of Large Language Models (Zheng et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.358.pdf