Automated Fine-Grained Mixture-of-Experts Quantization
Zhanhao Xie, Yuexiao Ma, Xiawu Zheng, Fei Chao, Wanchen Sui, Yong Li, Shen Li, Rongrong Ji
Abstract
The Mixture of Experts (MoE) architecture enables efficient model scaling through conditional computation, where only subset of parameters are activated per input. However, this distributed architecture poses unprecedented challenges for model compression, as conventional quantization methods optimized for dense networks prove inadequate. This paper introduces a specialized quantization framework for MoE architectures, motivated by our discovery that weight matrices across expert networks exhibit distinctive channel-wise outlier distributions, necessitating a more nuanced compression approach. Through theoretical analysis incorporating Fisher Information matrices and condition number characteristics, we establish a fundamental relationship between layer functionality and quantization sensitivity, demonstrating that down-projection layers inherently demand higher precision compared to up-projection layers. Leveraging these insights, we develop an automated channel-wise quantization framework that dynamically determines optimal bit-width allocations while maintaining minimal computational overhead through efficient statistical approximations. When evaluated on the Mixtral-8x7b-v0.1 architecture, our methodology demonstrates a 3.96% improvement over existing state-of-the-art approaches across natural language understanding benchmarks, while achieving superior compression ratios.- Anthology ID:
- 2025.findings-acl.1386
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2025
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venues:
- Findings | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 27024–27037
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1386/
- DOI:
- Cite (ACL):
- Zhanhao Xie, Yuexiao Ma, Xiawu Zheng, Fei Chao, Wanchen Sui, Yong Li, Shen Li, and Rongrong Ji. 2025. Automated Fine-Grained Mixture-of-Experts Quantization. In Findings of the Association for Computational Linguistics: ACL 2025, pages 27024–27037, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Automated Fine-Grained Mixture-of-Experts Quantization (Xie et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1386.pdf