mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Gregor Geigle, Abhay Jain, Radu Timofte, Goran Glavaš


Abstract
Modular vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to ‘understand’ the image input. With the abundance of readily available high-quality English image-text data as well as strong monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware. To this end, we re-align an image encoder previously tuned to an English LLM to a new, multilingual LLM using only a few million multilingual training examples derived from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark and XM3600, mBLIP yields results competitive with state-of-the-art models and it greatly outperforms strong English-only Vision-LLMs like Llava 1.5. We release our model, code, and train data at https://github.com/gregor-ge/mBLIP.
Anthology ID:
2024.alvr-1.2
Volume:
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Jing Gu, Tsu-Jui (Ray) Fu, Drew Hudson, Asli Celikyilmaz, William Wang
Venues:
ALVR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7–25
Language:
URL:
https://aclanthology.org/2024.alvr-1.2
DOI:
Bibkey:
Cite (ACL):
Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. 2024. mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs. In Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 7–25, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs (Geigle et al., ALVR-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.alvr-1.2.pdf