AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

Zhongzhi Chen; Guang Liu; Bo-Wen Zhang; Qinghong Yang; Ledell Wu

doi:10.18653/v1/2023.findings-acl.552

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Qinghong Yang, Ledell Wu

Abstract

CLIP (Contrastive Language–Image Pretraining) is an English multimodal representation model learned from a massive amount of English text-image pairs and has achieved great success in various downstream tasks, including image classification, text-to-image retrieval, and image generation. When extending CLIP to other languages, the major problem is the lack of good-quality text-image pairs. In this work, we present AltCLIP, a simple and low-resource method to build a strong multilingual multimodal representation model. Instead of training a model from scratch on multilingual text-image pairs, we take the original CLIP model trained on English text-image pairs and alter its text encoder with a pre-trained multilingual text encoder (XLM-R). We then align text and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. Our method utilizes the existence of rich parallel text data and pre-trained multilingual language models. We present extensive experimental evaluations to demonstrate the effectiveness of our proposed method. Our model sets new state-of-the-art zero-shot performances on a wide range of tasks in multilingual multimodal benchmarks, including ImageNet-CN/IT/JA/KO serials, Flicker30k-CN, COCO-CN, Multi30k, and XTD. Further, our model outperforms the original CLIP model on zero-shot cross-modal retrieval, Image Classification in the Wild (ICinW) tasks, and CLIP Benchmark. We plan to open-source our code, pre-trained model weights, and evaluation toolkits of multilingual multimodal tasks, to facilitate research on multilingual multimodal representation learning.

Anthology ID:: 2023.findings-acl.552
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8666–8682
Language:
URL:: https://aclanthology.org/2023.findings-acl.552
DOI:: 10.18653/v1/2023.findings-acl.552
Bibkey:
Cite (ACL):: Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Qinghong Yang, and Ledell Wu. 2023. AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8666–8682, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities (Chen et al., Findings 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/improve-issue-templates/2023.findings-acl.552.pdf

PDF Search