Grounding Multilingual Multimodal LLMs With Cultural Knowledge

Jean De Dieu Nyandwi, Yueqi Song, Simran Khanuja, Graham Neubig


Abstract
Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. Cultural-Pangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of +5.0%without degrading results on mainstream vision–language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.
Anthology ID:
2025.emnlp-main.1232
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24198–24242
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1232/
DOI:
Bibkey:
Cite (ACL):
Jean De Dieu Nyandwi, Yueqi Song, Simran Khanuja, and Graham Neubig. 2025. Grounding Multilingual Multimodal LLMs With Cultural Knowledge. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24198–24242, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Grounding Multilingual Multimodal LLMs With Cultural Knowledge (Nyandwi et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1232.pdf
Checklist:
 2025.emnlp-main.1232.checklist.pdf