Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin
Abstract
This paper revisits the implementation of Load-Balancing-Loss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as NE ∑i=1NE fipi, where NE is the total number of experts, fi represents the frequency of expert i being selected, and pi denotes the average gating score of the expert i. Existing MoE training frameworks usually employ the parallel training strategy so that fi and the LBL are calculated within a micro-batch and averaged across parallel groups.However, a micro-batch for training billion-scale LLMs typically contains very few sequences, leading to the micro-batch LBL being almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence.Under this strict constraint, even tokens from a domain-specific sequence (e.g., code) are uniformly routed to all experts, thereby inhibiting expert specialization.In this work, we propose calculating LBL using a global-batch to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize fi across micro-batches and then use it to calculate the LBL.Through experiments on training MoEs-based LLMs (up to 42.8B parameters and 400B tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks.Our analysis reveals that the global-batch LBL greatly improves the domain specialization of experts. Global-batch LBL is also used in Qwen3-MoEs.- Anthology ID:
- 2025.acl-long.249
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5005–5018
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.249/
- DOI:
- Cite (ACL):
- Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5005–5018, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models (Qiu et al., ACL 2025)
- PDF:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.249.pdf