Cross-lingual and Multilingual CLIP

Fredrik Carlsson, Philipp Eisen, Faton Rekathati, Magnus Sahlgren


Abstract
The long-standing endeavor of relating the textual and the visual domain recently underwent a pivotal breakthrough, as OpenAI released CLIP. This model distinguishes how well an English text corresponds with a given image with unprecedented accuracy. Trained via a contrastive learning objective over a huge dataset of 400M of images and captions, it is a work that is not easily replicated, especially for low resource languages. Capitalizing on the modularization of the CLIP architecture, we propose to use cross-lingual teacher learning to re-train the textual encoder for various non-English languages. Our method requires no image data and relies entirely on machine translation which removes the need for data in the target language. We find that our method can efficiently train a new textual encoder with relatively low computational cost, whilst still outperforming previous baselines on multilingual image-text retrieval.
Anthology ID:
2022.lrec-1.739
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6848–6854
Language:
URL:
https://aclanthology.org/2022.lrec-1.739
DOI:
Bibkey:
Cite (ACL):
Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Magnus Sahlgren. 2022. Cross-lingual and Multilingual CLIP. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6848–6854, Marseille, France. European Language Resources Association.
Cite (Informal):
Cross-lingual and Multilingual CLIP (Carlsson et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2022.lrec-1.739.pdf
Code
 FreddeFrallan/Multilingual-CLIP
Data
Flickr30kXTD10