RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training
Chulun Zhou, Yunlong Liang, Fandong Meng, Jinan Xu, Jinsong Su, Jie Zhou
Abstract
Multilingual vision-language (V&L) pre-training has achieved remarkable progress in learning universal representations across different modalities and languages. In spite of recent success, there still remain challenges limiting further improvements of V&L pre-trained models in multilingual settings. Particularly, current V&L pre-training methods rely heavily on strictly-aligned multilingual image-text pairs generated from English-centric datasets through machine translation. However, the cost of collecting and translating such strictly-aligned datasets is usually unbearable. In this paper, we propose Regularized Contrastive Cross-lingual Cross-modal (RC3) pre-training, which further exploits more abundant weakly-aligned multilingual image-text pairs. Specifically, we design a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs according to textual relevance. Besides, existing V&L pre-training approaches mainly deal with visual inputs by either region-of-interest (ROI) features or patch embeddings. We flexibly integrate the two forms of visual features into our model for pre-training and downstream multi-modal tasks. Extensive experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method over competitive contrast models with strong zero-shot capability.- Anthology ID:
- 2023.findings-acl.746
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11747–11762
- Language:
- URL:
- https://aclanthology.org/2023.findings-acl.746
- DOI:
- 10.18653/v1/2023.findings-acl.746
- Cite (ACL):
- Chulun Zhou, Yunlong Liang, Fandong Meng, Jinan Xu, Jinsong Su, and Jie Zhou. 2023. RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11747–11762, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training (Zhou et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2023.findings-acl.746.pdf