MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization

Chun Hu; Junhui He; Shangyu Wu; Yuxin He; Chun Jason Xue; Qingan Li

MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization

Chun Hu, Junhui He, Shangyu Wu, Yuxin He, Chun Jason Xue, Qingan Li

Abstract

Small language models (SLMs) are gaining attention for their lower computational and memory needs while maintaining strong performance. However, efficiently deploying SLMs on resource-constrained devices remains a significant challenge. Post-training quantization(PTQ) is a widely used compression technique that reduces memory usage and inference computation, yet existing methods face challenges in inefficient bit-width allocation and insufficient fine-grained quantization adjustments, leading to suboptimal performance, particularly at lower bit-widths. To address these challenges, we propose multi-level weight quantization (MLWQ), which facilitates the efficient deployment of SLMs. Our method enables more effective bit-width allocation by jointly considering inter-layer loss and intra-layer salience. Furthermore, we propose a fine-grained partitioning of intra-layer salience to support the tweaking of quantization parameters within each group. Experimental results indicate that MLWQ achieves competitive performance compared to state-of-the-art methods, providing an effective approach for the efficient deployment of SLMs while maintaining model accuracy.

Anthology ID:: 2025.emnlp-main.408
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8078–8088
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.408/
DOI:
Bibkey:
Cite (ACL):: Chun Hu, Junhui He, Shangyu Wu, Yuxin He, Chun Jason Xue, and Qingan Li. 2025. MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8078–8088, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization (Hu et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.408.pdf
Checklist:: 2025.emnlp-main.408.checklist.pdf

PDF Cite Search Checklist Fix data