UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity Mixture-of-Experts

Zhenyu Liu; Yunxin Li; Xuanyu Zhang; Qixun Teng; Shenyuan Jiang; Xinyu Chen; Haoyuan Shi; Haolan Chen; Fanbo Meng; Mingjun Zhao; Yu Xu; Yancheng He; Baotian Hu; Haizhou Li; Min Zhang

UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity Mixture-of-Experts

Zhenyu Liu, Yunxin li, Xuanyu Zhang, Qixun Teng, Shenyuan Jiang, Xinyu Chen, Haoyuan Shi, Haolan Chen, Fanbo Meng, Mingjun Zhao, Yu Xu, Yancheng He, Baotian Hu, Haizhou Li, Min Zhang

Abstract

Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts between semantic speech and structural music modeling, and severe data imbalances, which impede the development of a truly unified model. To address these challenges, we propose **UniMoE-Audio**, a unified speech and music generation model built upon a novel **D**ynamic-**C**apacity **M**ix-**o**f-**E**xperts (DCMoE) framework. Architecturally, UniMoE-Audio extends the conventional MoE paradigm by introducing a Top-P routing strategy for adaptive capacity allocation. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each specialists without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing universal audio generation.

Anthology ID:: 2026.acl-long.412
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9107–9119
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.412/
DOI:
Bibkey:
Cite (ACL):: Zhenyu Liu, Yunxin li, Xuanyu Zhang, Qixun Teng, Shenyuan Jiang, Xinyu Chen, Haoyuan Shi, Haolan Chen, Fanbo Meng, Mingjun Zhao, Yu Xu, Yancheng He, Baotian Hu, Haizhou Li, and Min Zhang. 2026. UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity Mixture-of-Experts. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9107–9119, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity Mixture-of-Experts (Liu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.412.pdf
Checklist:: 2026.acl-long.412.checklist.pdf

PDF Cite Search Checklist Fix data