Xiang Chang

2026

Large Language Models continue to scale in size and capability, driving substantial computational and memory demands.Mixture-of-Experts (MoE) architectures alleviate this cost by activating only a sparse subset of experts per token, enabling efficient scaling without proportional increases in inference compute.However, quantization in MoE models remains challenging due to heterogeneous sensitivity across experts and their internal linear layers.Existing mixed-precision frameworks such as Mixed-precision Quantization for MoE (MxMoE) require full quantization-loss evaluation for expert–layer–and-bit configurations, incurring prohibitive profiling cost.To address this, we propose **FRI-MxMoE**, a **profiling-free** mixed-precision quantization framework built on Fuzzy Rule Interpolation, designed as a drop-in replacement for the loss estimation component in MxMoE. By constructing a fuzzy rule base in the intra-expert layer feature space (bit-width, activation variance, parameter scale), our method predicts quantization error from only sparse samples, eliminating the need for dense profiling.Extensive experiments demonstrate that FRI-MxMoE accelerates the profiling phase by up to 15.7× (on DeepSeek-V2) while achieving comparable or slightly superior zero-shot accuracy (e.g., +1.04% on DeepSeekV2-Lite) compared to the baseline.This enables continuous sensitivity modeling, preserves accuracy under mixed-precision allocation, and reduces offline computation by orders of magnitude.

Co-authors

Ruiyu Zhuo 1

Venues

ACL1

Fix author