MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching

Fabian David Schmidt; Florian Schneider; Chris Biemann; Goran Glavaš

MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching

Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran Glavaš

Abstract

Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages – over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N’Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we highlight that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.

Anthology ID:: 2025.findings-acl.838
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16285–16312
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.838/
DOI:
Bibkey:
Cite (ACL):: Fabian David Schmidt, Florian Schneider, Chris Biemann, and Goran Glavaš. 2025. MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching. In Findings of the Association for Computational Linguistics: ACL 2025, pages 16285–16312, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching (Schmidt et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.838.pdf

PDF Cite Search Fix data