Dunjun Li
2026
ModularMoE: Fast LLM Customization with Parameter-Sharing Mixture-of-Experts for Low-Resource Settings
Jiaxing Liu | Qi Qi | Haifeng Sun | Dunjun Li | Zirui Zhuang | Bo He | Xiang Yang | Cong Liu | Jianxin Liao | Jingyu Wang
Findings of the Association for Computational Linguistics: ACL 2026
Jiaxing Liu | Qi Qi | Haifeng Sun | Dunjun Li | Zirui Zhuang | Bo He | Xiang Yang | Cong Liu | Jianxin Liao | Jingyu Wang
Findings of the Association for Computational Linguistics: ACL 2026
The massive size of Large Language Models (LLMs) imposes substantial computational and storage burdens, particularly on devices with limited hardware resources. Compared to foundation models, smaller and more specialized models are often more suitable for practical deployment. Existing customization approaches, such as the conventional “prune-then-finetune” paradigm or task-agnostic deployment strategies, either incur excessive computational costs or lead to suboptimal task performance. The recently popular Mixture-of-Experts (MoE) architecture exhibits a strong ability to mitigate inter-task interference, offering a new perspective on model deployment. In this paper, we introduce ModularMoE, a training framework that converts pre-trained LLMs into parameter-sharing MoE models for lightweight deployment. Exploiting the emergent modularity within LLMs, we split the feed-forward layers into multiple disjoint modules. Each expert is then constructed as a combination of such modules, enabling knowledge sharing across experts and thereby improving parameter efficiency within MoEs. Extensive experiments across multiple downstream tasks demonstrate that ModularMoE outperforms other state-of-the-art baselines at the same sparsity level, achieving an average performance improvement of 4.10% to 28.75% while delivering up to 2.71× inference speedup.