e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

Haonan Chen; Sicheng Gao; Radu Timofte; Tetsuya Sakai; Zhicheng Dou (窦志成)

e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, Zhicheng Dou

Abstract

Modern information systems often involve different types of items, , a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/collections/Haon-Chen/e5-omni.

Anthology ID:: 2026.findings-acl.970
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19430–19443
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.970/
DOI:
Bibkey:
Cite (ACL):: Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, and Zhicheng Dou. 2026. e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings. In Findings of the Association for Computational Linguistics: ACL 2026, pages 19430–19443, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings (Chen et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.970.pdf
Checklist:: 2026.findings-acl.970.checklist.pdf

PDF Cite Search Checklist Fix data