Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, Li Yuan, Zuozhu Liu


Abstract
Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, Instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta experts. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30%-50% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.
Anthology ID:
2024.findings-emnlp.221
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3843–3860
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.221/
DOI:
10.18653/v1/2024.findings-emnlp.221
Bibkey:
Cite (ACL):
Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, Li Yuan, and Zuozhu Liu. 2024. Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3843–3860, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models (Jiang et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.221.pdf