Unified Embeddings for Multimodal Retrieval via Frozen LLMs
Ziyang Wang, Heba Elfardy, Markus Dreyer, Kevin Small, Mohit Bansal
Abstract
In this work, We present Unified Embeddings for Multimodal Retrieval (UniMuR), a simple but effective approach that embeds multimodal inputs and retrieves visual and textual outputs via frozen Large Language Models (LLMs). Specifically, UniMuR jointly retrieves multimodal outputs via a unified multimodal embedding and applies dual alignment training to account for both visual and textual semantics. Thus, unlike previous approaches, UniMuR significantly reduces LLM’s modality bias towards generating text-only outputs. Meanwhile, the proposed unified multimodal embedding mitigates the inconsistency between visual and textual outputs and provides coherent multimodal outputs. Furthermore, benefiting from the joint training of visual and textual semantics, UniMuR also achieves strong image/text retrieval ability. Compared to existing approaches, UniMuR achieves better zero-shot multimodal response retrieval performance on MMDialog, improving the overall R@1 by 6.5% while boosting the image retrieval rate and having better cross-modal consistency on multimodal outputs. UniMuR also achieves 2.4% and 3.9% improvement on context-based image retrieval tasks on MMDialog and VisDial respectively when compared to previous approaches, validating its generalization ability across multiple tasks.- Anthology ID:
- 2024.findings-eacl.105
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2024
- Month:
- March
- Year:
- 2024
- Address:
- St. Julian’s, Malta
- Editors:
- Yvette Graham, Matthew Purver
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1537–1547
- Language:
- URL:
- https://aclanthology.org/2024.findings-eacl.105
- DOI:
- Cite (ACL):
- Ziyang Wang, Heba Elfardy, Markus Dreyer, Kevin Small, and Mohit Bansal. 2024. Unified Embeddings for Multimodal Retrieval via Frozen LLMs. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1537–1547, St. Julian’s, Malta. Association for Computational Linguistics.
- Cite (Informal):
- Unified Embeddings for Multimodal Retrieval via Frozen LLMs (Wang et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-3/2024.findings-eacl.105.pdf