I0T: Embedding Standardization Method Towards Zero Modality Gap

Na Min An, Eunki Kim, James Thorne, Hyunjung Shim


Abstract
Contrastive Language-Image Pretraining (CLIP) enables zero-shot inference in downstream tasks such as image-text retrieval and classification. However, recent works extending CLIP suffer from the issue of *modality gap*, which arises when the image and text embeddings are projected to disparate manifolds, deviating from the intended objective of image-text contrastive learning. We discover that this phenomenon is linked to the modality-specific characteristic that each image or text encoder independently possesses. Herein, we propose two methods to address the modality gap: (1) a post-hoc embedding standardization method, I0Tpost that reduces the modality gap approximately to zero and (2) a trainable method, I0Tasync, to alleviate the modality gap problem by adding two normalization layers for each encoder. Our I0T framework can significantly reduce the modality gap while preserving the original embedding representations of trained models with their locked parameters. In practice, I0Tpost can serve as an alternative explainable automatic evaluation metric of widely used CLIPScore (CLIP-S). The code is available in https://github.com/xfactlab/I0T.
Anthology ID:
2025.acl-long.1319
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
27182–27199
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1319/
DOI:
Bibkey:
Cite (ACL):
Na Min An, Eunki Kim, James Thorne, and Hyunjung Shim. 2025. I0T: Embedding Standardization Method Towards Zero Modality Gap. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27182–27199, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
I0T: Embedding Standardization Method Towards Zero Modality Gap (An et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1319.pdf