AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Qinghong Yang, Ledell Wu


Abstract
CLIP (Contrastive Language–Image Pretraining) is an English multimodal representation model learned from a massive amount of English text-image pairs and has achieved great success in various downstream tasks, including image classification, text-to-image retrieval, and image generation. When extending CLIP to other languages, the major problem is the lack of good-quality text-image pairs. In this work, we present AltCLIP, a simple and low-resource method to build a strong multilingual multimodal representation model. Instead of training a model from scratch on multilingual text-image pairs, we take the original CLIP model trained on English text-image pairs and alter its text encoder with a pre-trained multilingual text encoder (XLM-R). We then align text and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. Our method utilizes the existence of rich parallel text data and pre-trained multilingual language models. We present extensive experimental evaluations to demonstrate the effectiveness of our proposed method. Our model sets new state-of-the-art zero-shot performances on a wide range of tasks in multilingual multimodal benchmarks, including ImageNet-CN/IT/JA/KO serials, Flicker30k-CN, COCO-CN, Multi30k, and XTD. Further, our model outperforms the original CLIP model on zero-shot cross-modal retrieval, Image Classification in the Wild (ICinW) tasks, and CLIP Benchmark. We plan to open-source our code, pre-trained model weights, and evaluation toolkits of multilingual multimodal tasks, to facilitate research on multilingual multimodal representation learning.
Anthology ID:
2023.findings-acl.552
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8666–8682
Language:
URL:
https://aclanthology.org/2023.findings-acl.552
DOI:
10.18653/v1/2023.findings-acl.552
Bibkey:
Cite (ACL):
Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Qinghong Yang, and Ledell Wu. 2023. AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8666–8682, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities (Chen et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2023.findings-acl.552.pdf