LegoMT2: Selective Asynchronous Sharded Data Parallel Training for Massive Neural Machine Translation

Fei Yuan; Yinquan Lu; Lei Li; Jingjing Xu

LegoMT2: Selective Asynchronous Sharded Data Parallel Training for Massive Neural Machine Translation

Fei Yuan, Yinquan Lu, Lei Li, Jingjing Xu

Abstract

It is a critical challenge to learn a single model for massive languages. Prior methods focus on increasing the model size and training data size. However, large models are difficult to optimize efficiently even with distributed parallel training and translation capacity can interfere among languages. To address the challenge, we propose LegoMT2, an efficient training approach with an asymmetric multi-way model architecture for massive multilingual neural machine translation. LegoMT2 shards 435 languages into 8 language-centric groups and attributes one local encoder for each group’s languages and a mix encoder-decoder for all languages. LegoMT2 trains the model through local data parallel and asynchronous distributed updating of parameters. LegoMT2 is 16.2× faster than the distributed training method for M2M-100-12B (which only for 100 languages) while improving the translation performance by an average of 2.2 BLEU on Flores-101, especially performing better for low-resource languages .

Anthology ID:: 2025.findings-acl.1200
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23359–23376
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1200/
DOI:
Bibkey:
Cite (ACL):: Fei Yuan, Yinquan Lu, Lei Li, and Jingjing Xu. 2025. LegoMT2: Selective Asynchronous Sharded Data Parallel Training for Massive Neural Machine Translation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 23359–23376, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: LegoMT2: Selective Asynchronous Sharded Data Parallel Training for Massive Neural Machine Translation (Yuan et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1200.pdf

PDF Cite Search Fix data