Fanxin Li
2025
FoldMoE: Efficient Long Sequence MoE Training via Attention-MoE Pipelining
Guichao Zhu
|
Lintian Lei
|
Yuhao Qing
|
Yichao Fu
|
Fanxin Li
|
Dong Huang
|
Zekai Sun
|
Heming Cui
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Training LLMs with Mixture-of-Experts (MoE) architecture on long sequences poses significant challenges due to the all-to-all communication bottleneck of expert parallelism. While existing approaches attempt to hide the communication costs in computation through token-level pipelining within MoE layers, their effectiveness is limited by the insufficient computation. We present FoldMoE, a high-performance MoE training system that enables token-level overlapping across entire Transformer blocks through novel attention-MoE pipelining. We propose an efficient pipeline schedule, and a novel token buffering design to decouple attention and MoE layer partitioning, along with a time-uniform micro-batching strategy for enhanced efficiency. Evaluations on GPT-MoE models with sequences up to 32K tokens show that FoldMoE achieves up to 1.49x and 2.72x speedup over state-of-the-art token-level overlapping and non-overlapping baselines respectively.
Search
Fix author
Co-authors
- Heming Cui 1
- Yichao Fu 1
- Dong Huang 1
- Lintian Lei 1
- Yuhao Qing 1
- show all...
Venues
- acl1