Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning

Nathaniel Krasner, Nicholas Lanuzo, Antonios Anastasopoulos


Abstract
Multilingual alignment of sentence representations has mostly required bitexts to bridge the gap between languages. We investigate whether visual information can bridge this gap instead. Image caption datasets are very easy to create without requiring multilingual expertise, so this offers a more efficient alternative for low-resource languages. We find that multilingual image-caption alignment can implicitly align the text representations between languages, languages unseen by the encoder in pretraining can be incorporated into this alignment post-hoc, and these aligned representations are usable for cross-lingual Natural Language Understanding (NLU) and bitext retrieval.
Anthology ID:
2025.acl-short.95
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1193–1199
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.acl-short.95/
DOI:
Bibkey:
Cite (ACL):
Nathaniel Krasner, Nicholas Lanuzo, and Antonios Anastasopoulos. 2025. Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1193–1199, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning (Krasner et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.acl-short.95.pdf