Beyond Dynamic Quantization: An Efficient Static Hierarchical Mix-precision Framework for Near-Lossless LLM Compression

Yi Zhang, Kai Zhang, Zheyang Li, Wenming Tan, Ye Ren, Jilin Hu


Abstract
Large language models (LLMs) have achieved overwhelming success but require massive storage and computational resources to support the generative inference. Post-training quantization (PTQ) is a promising approach to reduce memory usage, latency and energy consumption of the deployment of LLMs. However, the presence of outliers makes most existing PTQ methods dedicated to dynamic quantization, which turns out hardware-unfriendly and often leads to large quantization errors in static scenarios. To address the above limitations, we introduce a Static Hierarchical Mix-precision Quantization method (SHMQ), which enables near-lossless and hardware-friendly compression of LLMs. Theoretically, our proposed SHMQ quantifies both inter-layer and intra-layer sensitivity through unified derivations involving Hessian. Specifically, SHMQ conducts a systematic precision allocation strategy, which seamlessly integrates coarse-grained inter-layer and fine-grained intra-layer static mix-precision quantization. Furthermore, the permutation procedure, which reorders sensitive channels and insensitive channels that share similar distribution, is leveraged to mitigate static quantization error. Our proposed SHMQ achieves 75.58% on zero-shot reasoning tasks in W4.8A8 Qwen2.5-7B-Instruct, narrowing the accuracy gap to merely 0.13% while yielding averaged 2.86× practical speedup.
Anthology ID:
2025.emnlp-industry.175
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
November
Year:
2025
Address:
Suzhou (China)
Editors:
Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2573–2587
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.175/
DOI:
Bibkey:
Cite (ACL):
Yi Zhang, Kai Zhang, Zheyang Li, Wenming Tan, Ye Ren, and Jilin Hu. 2025. Beyond Dynamic Quantization: An Efficient Static Hierarchical Mix-precision Framework for Near-Lossless LLM Compression. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2573–2587, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):
Beyond Dynamic Quantization: An Efficient Static Hierarchical Mix-precision Framework for Near-Lossless LLM Compression (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.175.pdf