Fei Chao

2026

Large Language Models continue to scale in size and capability, driving substantial computational and memory demands.Mixture-of-Experts (MoE) architectures alleviate this cost by activating only a sparse subset of experts per token, enabling efficient scaling without proportional increases in inference compute.However, quantization in MoE models remains challenging due to heterogeneous sensitivity across experts and their internal linear layers.Existing mixed-precision frameworks such as Mixed-precision Quantization for MoE (MxMoE) require full quantization-loss evaluation for expert–layer–and-bit configurations, incurring prohibitive profiling cost.To address this, we propose **FRI-MxMoE**, a **profiling-free** mixed-precision quantization framework built on Fuzzy Rule Interpolation, designed as a drop-in replacement for the loss estimation component in MxMoE. By constructing a fuzzy rule base in the intra-expert layer feature space (bit-width, activation variance, parameter scale), our method predicts quantization error from only sparse samples, eliminating the need for dense profiling.Extensive experiments demonstrate that FRI-MxMoE accelerates the profiling phase by up to 15.7× (on DeepSeek-V2) while achieving comparable or slightly superior zero-shot accuracy (e.g., +1.04% on DeepSeekV2-Lite) compared to the baseline.This enables continuous sensitivity modeling, preserves accuracy under mixed-precision allocation, and reduces offline computation by orders of magnitude.

2025

pdf bib abs

The Mixture of Experts (MoE) architecture enables efficient model scaling through conditional computation, where only subset of parameters are activated per input. However, this distributed architecture poses unprecedented challenges for model compression, as conventional quantization methods optimized for dense networks prove inadequate. This paper introduces a specialized quantization framework for MoE architectures, motivated by our discovery that weight matrices across expert networks exhibit distinctive channel-wise outlier distributions, necessitating a more nuanced compression approach. Through theoretical analysis incorporating Fisher Information matrices and condition number characteristics, we establish a fundamental relationship between layer functionality and quantization sensitivity, demonstrating that down-projection layers inherently demand higher precision compared to up-projection layers. Leveraging these insights, we develop an automated channel-wise quantization framework that dynamically determines optimal bit-width allocations while maintaining minimal computational overhead through efficient statistical approximations. When evaluated on the Mixtral-8x7b-v0.1 architecture, our methodology demonstrates a 3.96% improvement over existing state-of-the-art approaches across natural language understanding benchmarks, while achieving superior compression ratios.

Co-authors

Venues

ACL1
Findings1

Fix author