Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

Po-Yao Huang; Mandela Patrick; Junjie Hu; Graham Neubig; Florian Metze; Alexander G. Hauptmann

doi:10.18653/v1/2021.naacl-main.195

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, Alexander Hauptmann

Abstract

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextual multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (Multi-HowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M.

Anthology ID:: 2021.naacl-main.195
Volume:: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:: June
Year:: 2021
Address:: Online
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2443–2459
Language:
URL:: https://aclanthology.org/2021.naacl-main.195
DOI:: 10.18653/v1/2021.naacl-main.195
Bibkey:
Cite (ACL):: Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander Hauptmann. 2021. Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2443–2459, Online. Association for Computational Linguistics.
Cite (Informal):: Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models (Huang et al., NAACL 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/2021.naacl-main.195.pdf
Optional supplementary data:: 2021.naacl-main.195.OptionalSupplementaryData.txt
Video:: https://preview.aclanthology.org/ingestion-script-update/2021.naacl-main.195.mp4
Code: berniebear/Multi-HT100M
Data: HowTo100M, VATEX

PDF Search Code Optional supplementary data Video