CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

Jihai Zhang, Xiaoye Qu, Tong Zhu, Yu Cheng


Abstract
Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies discovered that CLIP can only encode one aspect of the feature space, leading to substantial information loss and indistinctive features. To mitigate this issue, this paper introduces a novel strategy that fine-tunes a series of complementary CLIP models and transforms them into a CLIP-MoE. Specifically, we propose a model-agnostic Diversified Multiplet Upcycling (DMU) framework for CLIP. Instead of training multiple CLIP models from scratch, DMU leverages a pre-trained CLIP and fine-tunes it into a diverse set with highly cost-effective multistage contrastive learning, thus capturing distinct feature subspaces efficiently. To fully exploit these fine-tuned models while minimizing computational overhead, we transform them into a CLIP-MoE, which dynamically activates a subset of CLIP experts, achieving an effective balance between model capacity and computational cost. Comprehensive experiments demonstrate the superior performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks when used as a vision encoder. Code is available at https://github.com/OpenSparseLLMs/CLIP-MoE.
Anthology ID:
2025.emnlp-main.275
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5406–5419
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.275/
DOI:
Bibkey:
Cite (ACL):
Jihai Zhang, Xiaoye Qu, Tong Zhu, and Yu Cheng. 2025. CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5406–5419, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.275.pdf
Checklist:
 2025.emnlp-main.275.checklist.pdf