LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-Training

Tong Zhu (朱桐); Xiaoye Qu; Daize Dong; Jiacheng Ruan; Jingqi Tong; Conghui He; Yu Cheng

doi:10.18653/v1/2024.emnlp-main.890

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-Training

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, Yu Cheng

Abstract

Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters.

Anthology ID:: 2024.emnlp-main.890
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15913–15923
Language:
URL:: https://preview.aclanthology.org/landing_page/2024.emnlp-main.890/
DOI:: 10.18653/v1/2024.emnlp-main.890
Bibkey:
Cite (ACL):: Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. 2024. LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-Training. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15913–15923, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-Training (Zhu et al., EMNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2024.emnlp-main.890.pdf

PDF Cite Search Fix data