When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

Weilan Wang; Yu Mao; Tang Dongdong; Du Hongchao; Nan Guan; Chun Jason Xue

doi:10.18653/v1/2024.findings-emnlp.988

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

Weilan Wang, Yu Mao, Tang Dongdong, Du Hongchao, Nan Guan, Chun Jason Xue

Abstract

Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.

Anthology ID:: 2024.findings-emnlp.988
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16973–16983
Language:
URL:: https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.988/
DOI:: 10.18653/v1/2024.findings-emnlp.988
Bibkey:
Cite (ACL):: Weilan Wang, Yu Mao, Tang Dongdong, Du Hongchao, Nan Guan, and Chun Jason Xue. 2024. When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16973–16983, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models (Wang et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.988.pdf

PDF Search Fix data