Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

Ece Takmaz, Lisa Bylinina, Jakub Dotlacil


Abstract
State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in language-only tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with model merging, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.
Anthology ID:
2025.babylm-main.5
Volume:
Proceedings of the First BabyLM Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:
BabyLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
66–75
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.5/
DOI:
Bibkey:
Cite (ACL):
Ece Takmaz, Lisa Bylinina, and Jakub Dotlacil. 2025. Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models. In Proceedings of the First BabyLM Workshop, pages 66–75, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models (Takmaz et al., BabyLM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.5.pdf