Faster MoE LLM Inference for Extremely Large Models

Haoqi Yang; Luohe Shi; Qiwei Li; Zuchao Li; Ping Wang; Hao Huang; Hai Zhao

Faster MoE LLM Inference for Extremely Large Models

Haoqi Yang, Luohe Shi, Qiwei Li, Zuchao Li, Ping Wang, Hao Huang, Hai Zhao

Abstract

In fine-grained sparse Mixture-of-Experts (MoE) models, a large pool of specialized experts replaces a small homogeneous set, shifting performance and throughput to be governed by inference-time expert activation. Yet most existing optimization recipes implicitly assume a fixed activation budget (e.g., a constant Top-k per layer), whose behavior in fine-grained MoEs is poorly understood. We first characterize runtime skipping strategies, quantifying the accuracy–efficiency trade-off of (i) uniform fixed activation and (ii) static layer-wise Top-k allocation found by search. Our analysis reveals that static skipping can already provide substantial throughput gains, but optimal static schedules vary significantly across models and routing mechanisms. We therefore introduce Adaptive Skipping with Entropy-Penalized Thresholding (ASET), a training-free policy that adapts token-level activation using router confidence and entropy while remaining within the model’s original budget. Across the fine-grained MoEs we study, static skipping policies yield 10–78% throughput gains with minimal performance degradation, including ≥10% improvement on DeepSeek-V3 without measurable loss. On the OLMoE testbed, ASET yields a Pareto frontier between average activation and task quality. Overall, these results identify expert skipping as a practical lever for faster fine-grained MoE inference, with adaptive activation helping when fixed budgets are too rigid.

Anthology ID:: 2026.findings-acl.2140
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 43133–43151
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2140/
DOI:
Bibkey:
Cite (ACL):: Haoqi Yang, Luohe Shi, Qiwei Li, Zuchao Li, Ping Wang, Hao Huang, and Hai Zhao. 2026. Faster MoE LLM Inference for Extremely Large Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 43133–43151, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Faster MoE LLM Inference for Extremely Large Models (Yang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2140.pdf
Checklist:: 2026.findings-acl.2140.checklist.pdf

PDF Cite Search Checklist Fix data