AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Qinghong Yang, Ledell Wu
Abstract
CLIP (Contrastive Language–Image Pretraining) is an English multimodal representation model learned from a massive amount of English text-image pairs and has achieved great success in various downstream tasks, including image classification, text-to-image retrieval, and image generation. When extending CLIP to other languages, the major problem is the lack of good-quality text-image pairs. In this work, we present AltCLIP, a simple and low-resource method to build a strong multilingual multimodal representation model. Instead of training a model from scratch on multilingual text-image pairs, we take the original CLIP model trained on English text-image pairs and alter its text encoder with a pre-trained multilingual text encoder (XLM-R). We then align text and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. Our method utilizes the existence of rich parallel text data and pre-trained multilingual language models. We present extensive experimental evaluations to demonstrate the effectiveness of our proposed method. Our model sets new state-of-the-art zero-shot performances on a wide range of tasks in multilingual multimodal benchmarks, including ImageNet-CN/IT/JA/KO serials, Flicker30k-CN, COCO-CN, Multi30k, and XTD. Further, our model outperforms the original CLIP model on zero-shot cross-modal retrieval, Image Classification in the Wild (ICinW) tasks, and CLIP Benchmark. We plan to open-source our code, pre-trained model weights, and evaluation toolkits of multilingual multimodal tasks, to facilitate research on multilingual multimodal representation learning.- Anthology ID:
- 2023.findings-acl.552
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8666–8682
- Language:
- URL:
- https://aclanthology.org/2023.findings-acl.552
- DOI:
- 10.18653/v1/2023.findings-acl.552
- Cite (ACL):
- Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Qinghong Yang, and Ledell Wu. 2023. AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8666–8682, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities (Chen et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/add_acl24_videos/2023.findings-acl.552.pdf