Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models

Hanqi Yan, Xiangxiang Cui, Lu Yin, Jindong Gu, Paul Pu Liang, Yulan He, Yifei Wang


Abstract
The success of vision-language models is primarily attributed to effective cross-modal alignment between vision and language. However, modality gaps persist even in well-aligned models and may be necessary for human perception, as evidenced by modality-specific phenomena such as visual texture and linguistic tone. These observations motivate us to computationally measure and leverage modality gaps to explore their utility in downstream applications. In this paper, we introduce the Modality Dominance Score (MDS), which attributes multimodal features to specific modalities by categorizing them as vision-dominant, language-dominant, or cross-modal. We then propose automatic interpretability metrics to evaluate these modality-specific features in a scalable manner. Finally, we demonstrate how the identified modality-specific features enable training-free probing and editing methods for understanding model perception across genders, generating adversarial examples, and controlling text-to-image generation. Combined with task-agnostic interpretability tools, our work provides a systematic framework for analyzing and efficiently controlling multimodal models.
Anthology ID:
2026.findings-acl.588
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12123–12138
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.588/
DOI:
Bibkey:
Cite (ACL):
Hanqi Yan, Xiangxiang Cui, Lu Yin, Jindong Gu, Paul Pu Liang, Yulan He, and Yifei Wang. 2026. Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 12123–12138, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models (Yan et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.588.pdf
Checklist:
 2026.findings-acl.588.checklist.pdf