Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text

Piyush Singh Pasi


Abstract
Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely on machine translation, while advances in multilingual text modeling remain underutilized. We introduce M2M, a lightweight alignment method that learns only a few linear layers–using English text alone–to map multilingual text embeddings into multimodal space. Despite its simplicity, M2M matches baseline performance in English (94.9% Recall@10) and achieves strong zero-shot transfer (89.5% Recall@10 averaged across 11 languages, 10 unseen) on XTD Text-to-Image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, M2M demonstrates robustness across datasets and tasks, extending to Audio-Text retrieval and Text-to-Image generation. We release [code and checkpoints](https://github.com/piyushsinghpasi/M2M) along with multilingual evaluation datasets: [MSCOCO Multilingual 30K](https://huggingface.co/datasets/piyushsinghpasi/mscoco-multilingual-30k), [AudioCaps Multilingual](https://huggingface.co/datasets/piyushsinghpasi/audiocaps-multilingual), and [Clotho Multilingual](https://huggingface.co/datasets/piyushsinghpasi/clotho-multilingual).
Anthology ID:
2026.findings-eacl.143
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2750–2771
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.143/
DOI:
Bibkey:
Cite (ACL):
Piyush Singh Pasi. 2026. Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text. In Findings of the Association for Computational Linguistics: EACL 2026, pages 2750–2771, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text (Pasi, Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.143.pdf
Checklist:
 2026.findings-eacl.143.checklist.pdf