Profiling-Free Mixed-Precision Quantization for MoE LLMs via Fuzzy Rule Interpolation

Huachen Qi; Ruiyu Zhuo; Bowen Shi; Xiang Chang; Fei Chao; Changjing Shang; Qiang Shen

Profiling-Free Mixed-Precision Quantization for MoE LLMs via Fuzzy Rule Interpolation

Huachen Qi, Ruiyu Zhuo, Bowen Shi, Xiang Chang, Fei Chao, Changjing Shang, Qiang Shen

Abstract

Large Language Models continue to scale in size and capability, driving substantial computational and memory demands.Mixture-of-Experts (MoE) architectures alleviate this cost by activating only a sparse subset of experts per token, enabling efficient scaling without proportional increases in inference compute.However, quantization in MoE models remains challenging due to heterogeneous sensitivity across experts and their internal linear layers.Existing mixed-precision frameworks such as Mixed-precision Quantization for MoE (MxMoE) require full quantization-loss evaluation for expert–layer–and-bit configurations, incurring prohibitive profiling cost.To address this, we propose **FRI-MxMoE**, a **profiling-free** mixed-precision quantization framework built on Fuzzy Rule Interpolation, designed as a drop-in replacement for the loss estimation component in MxMoE. By constructing a fuzzy rule base in the intra-expert layer feature space (bit-width, activation variance, parameter scale), our method predicts quantization error from only sparse samples, eliminating the need for dense profiling.Extensive experiments demonstrate that FRI-MxMoE accelerates the profiling phase by up to 15.7× (on DeepSeek-V2) while achieving comparable or slightly superior zero-shot accuracy (e.g., +1.04% on DeepSeekV2-Lite) compared to the baseline.This enables continuous sensitivity modeling, preserves accuracy under mixed-precision allocation, and reduces offline computation by orders of magnitude.

Anthology ID:: 2026.acl-long.982
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21484–21499
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.982/
DOI:
Bibkey:
Cite (ACL):: Huachen Qi, Ruiyu Zhuo, Bowen Shi, Xiang Chang, Fei Chao, Changjing Shang, and Qiang Shen. 2026. Profiling-Free Mixed-Precision Quantization for MoE LLMs via Fuzzy Rule Interpolation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21484–21499, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Profiling-Free Mixed-Precision Quantization for MoE LLMs via Fuzzy Rule Interpolation (Qi et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.982.pdf
Checklist:: 2026.acl-long.982.checklist.pdf

PDF Cite Search Checklist Fix data