Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Songtao Jiang; Tuo Zheng; Yan Zhang (张琰, 张廷); Yeying Jin; Li Yuan; Zuozhu Liu

doi:10.18653/v1/2024.findings-emnlp.221

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, Li Yuan, Zuozhu Liu

Abstract

Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, Instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta experts. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30%-50% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.

Anthology ID:: 2024.findings-emnlp.221
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3843–3860
Language:
URL:: https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.221/
DOI:: 10.18653/v1/2024.findings-emnlp.221
Bibkey:
Cite (ACL):: Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, Li Yuan, and Zuozhu Liu. 2024. Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3843–3860, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models (Jiang et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.221.pdf

PDF Search Fix data