From Pseudo-Balancing to True Specialization: Memory-Aware Routing for Mixture-of-Experts

Peixuan Hou; Yunbo Hou; Bin Chen; Li He (何丽); Jian Xu; Weiping Li; Bo Zheng; Guojie Song

From Pseudo-Balancing to True Specialization: Memory-Aware Routing for Mixture-of-Experts

Peixuan Hou, Yunbo Hou, Bin Chen, LI He, Jian Xu, Weiping Li, Bo Zheng, Guojie Song

Abstract

Mixture-of-Experts (MoE) efficiently trains large models by using sparse activation to lower costs, selecting a few experts based on data characteristics. For MoE, an unbalanced expert load will lead to inefficient expert utilization and routing collapse. Existing methods commonly achieve an expert-centered balancing strategy to solve it, prioritizing equal utilization of experts over semantic alignment between tokens and experts. However, this can lead to a pseudo-balance phenomenon: To ensure expert load balancing, the same input is randomly routed to different experts across training steps instead of the most matching one. It introduces two critical issues: (1) Severe knowledge overlap among experts, resulting in redundant representations and inefficient parameter utilization. (2) Difficulty in forming and stabilizing expert specialization. These issues limit the scalability of models, especially large language models (LLM). To address these limitations, we introduce Memory-Aware Routing (MAR), a training-phase approach that enhances existing load-balancing strategies. By equipping each expert with a memory buffer, our method explicitly models their long-term preferences, allowing historical experience to guide routing. This ensures that tokens are routed more consistently to compatible experts, mitigating the pseudo-balance problem while maintaining global load balance and fostering expert specialization. Experimental results show that MAR improves expert specialization by 35% and downstream accuracy by 2%-25%, doubles parameter efficiency, and matches baseline performance with only half the experts.

Anthology ID:: 2026.findings-acl.857
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17320–17337
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.857/
DOI:
Bibkey:
Cite (ACL):: Peixuan Hou, Yunbo Hou, Bin Chen, LI He, Jian Xu, Weiping Li, Bo Zheng, and Guojie Song. 2026. From Pseudo-Balancing to True Specialization: Memory-Aware Routing for Mixture-of-Experts. In Findings of the Association for Computational Linguistics: ACL 2026, pages 17320–17337, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: From Pseudo-Balancing to True Specialization: Memory-Aware Routing for Mixture-of-Experts (Hou et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.857.pdf
Checklist:: 2026.findings-acl.857.checklist.pdf

PDF Cite Search Checklist Fix data