mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou


Abstract
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets, and models are released in https://github.com/haon-chen/mmE5.
Anthology ID:
2025.findings-acl.433
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:
Findings | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8254–8275
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.433/
DOI:
Bibkey:
Cite (ACL):
Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, and Zhicheng Dou. 2025. mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data. In Findings of the Association for Computational Linguistics: ACL 2025, pages 8254–8275, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data (Chen et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.433.pdf