GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference

Chao Zeng; Songwei Liu; Shu Yang; Fangmin Chen; Xing Mei; Lean Fu

GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference

Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Xing Mei, Lean Fu

Abstract

Model compression has emerged as a mainstream solution to reduce memory usage and computational overhead. This paper proposes GQSA, a novel model compression framework specifically designed for LLMs. Traditional methods typically focus exclusively on either quantization or sparsification, but relying on a single strategy often results in significant performance loss at high compression rates. In contrast, GQSA integrates quantization and sparsification in a tightly coupled manner, leveraging GPU-friendly structured group sparsity and quantization for efficient acceleration. Building upon system-algorithm co-design principles, we propose a two-stage sparse optimization strategy that ensures the performance superiority of the compressed model. On the engine side, we introduce a “task-centric” parallel strategy, which, to the best of our knowledge, is the first application in the domain of sparse computing. Compared to the traditional 2:4 sparse method, the GQSA offers a more flexible and adjustable sparsity rate, as well as a higher weight compression rate, and is efficiently compatible with weight-only quantization methods. Experimental results demonstrate that, under the GQSA W4S50% compression setting, the model’s accuracy surpasses that of both 2:4 pruning and W2 quantization. Furthermore, at the inference level, GQSA outperforms W2 by 1.26 × and 2:4 pruning by 2.35 × in terms of speed.

Anthology ID:: 2025.ijcnlp-long.10
Volume:: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:: IJCNLP | AACL
SIG:
Publisher:: The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:: 149–165
Language:
URL:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.10/
DOI:
Bibkey:
Cite (ACL):: Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Xing Mei, and Lean Fu. 2025. GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 149–165, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):: GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference (Zeng et al., IJCNLP-AACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.10.pdf

PDF Cite Search Fix data