MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

Yishu Lei; Shuwei He; Hu Jing; Dan Zhang; Xianlong Luo; Danxiang Zhu; Shikun Feng; Rui Liu; Jingzhou He; Yu Sun; Hua Wu (吴华); Haifeng Wang

MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

Yishu Lei, Shuwei He, Hu Jing, Dan Zhang, Xianlong Luo, Danxiang Zhu, Shikun Feng, Rui Liu, Jingzhou HE, Yu Sun, Hua Wu, Haifeng Wang

Abstract

Extending the input modality of Large Language Models (LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically heterogeneous, entangling attributes such as speech, music, and environmental context. Existing research is limited to a dense, parameter-shared adapter to model these diverse patterns, which induces gradient conflict during optimization, as parameter updates required for distinct attributes contradict each other. To address this limitation, we introduce the MoE-Adapter, a sparse Mixture-of-Experts (MoE) architecture designed to decouple acoustic information. Specifically, it employs a dynamic gating mechanism that routes audio tokens to specialized experts capturing complementary feature subspaces while retaining shared experts for global context, thereby mitigating gradient conflicts and enabling fine-grained feature learning. Comprehensive experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks, consistently outperforming dense linear baselines with comparable computational costs. To facilitate future research, our code are publicly available at https://github.com/Alittleegg/Eureka-Audio.

Anthology ID:: 2026.findings-acl.840
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17039–17050
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.840/
DOI:
Bibkey:
Cite (ACL):: Yishu Lei, Shuwei He, Hu Jing, Dan Zhang, Xianlong Luo, Danxiang Zhu, Shikun Feng, Rui Liu, Jingzhou HE, Yu Sun, Hua Wu, and Haifeng Wang. 2026. MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free. In Findings of the Association for Computational Linguistics: ACL 2026, pages 17039–17050, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free (Lei et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.840.pdf
Checklist:: 2026.findings-acl.840.checklist.pdf

PDF Cite Search Checklist Fix data