MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization
Chun Hu, Junhui He, Shangyu Wu, Yuxin He, Chun Jason Xue, Qingan Li
Abstract
Small language models (SLMs) are gaining attention for their lower computational and memory needs while maintaining strong performance. However, efficiently deploying SLMs on resource-constrained devices remains a significant challenge. Post-training quantization(PTQ) is a widely used compression technique that reduces memory usage and inference computation, yet existing methods face challenges in inefficient bit-width allocation and insufficient fine-grained quantization adjustments, leading to suboptimal performance, particularly at lower bit-widths. To address these challenges, we propose multi-level weight quantization (MLWQ), which facilitates the efficient deployment of SLMs. Our method enables more effective bit-width allocation by jointly considering inter-layer loss and intra-layer salience. Furthermore, we propose a fine-grained partitioning of intra-layer salience to support the tweaking of quantization parameters within each group. Experimental results indicate that MLWQ achieves competitive performance compared to state-of-the-art methods, providing an effective approach for the efficient deployment of SLMs while maintaining model accuracy.- Anthology ID:
- 2025.emnlp-main.408
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8078–8088
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.408/
- DOI:
- Cite (ACL):
- Chun Hu, Junhui He, Shangyu Wu, Yuxin He, Chun Jason Xue, and Qingan Li. 2025. MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8078–8088, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization (Hu et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.408.pdf