Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data
Badr AlKhamissi, Yingtian Tang, Abdülkadir Gökce, Johannes Mehrer, Martin Schrimpf
Abstract
While today’s large language models exhibit impressive abilities in generating human-like text, they require massive amounts of data during training. We here take inspiration from human cognitive development to train models in limited data conditions. Specifically we present a self-synthesis approach that iterates through four phases: Phase 1 sets up fundamental language abilities, training the model from scratch on a small corpus. Language is then associated with the visual environment in phase 2, integrating the model with a vision encoder to generate descriptive captions from labeled images. In the “self-synthesis” phase 3, the model generates captions for unlabeled images, that it then uses to further train its language component with a mix of synthetic, and previous real-world text. This phase is meant to expand the model’s linguistic repertoire, similar to humans self-annotating new experiences. Finally, phase 4 develops advanced cognitive skills, by training the model on specific tasks such as visual question answering and reasoning. Our approach offers a proof of concept for training a multimodal model using a developmentally plausible amount of data.- Anthology ID:
- 2024.conll-babylm.22
- Volume:
- The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
- Month:
- November
- Year:
- 2024
- Address:
- Miami, FL, USA
- Editors:
- Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Leshem Choshen, Ryan Cotterell, Alex Warstadt, Ethan Gotlieb Wilcox
- Venues:
- CoNLL | BabyLM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 244–251
- Language:
- URL:
- https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2024.conll-babylm.22/
- DOI:
- Cite (ACL):
- Badr AlKhamissi, Yingtian Tang, Abdülkadir Gökce, Johannes Mehrer, and Martin Schrimpf. 2024. Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 244–251, Miami, FL, USA. Association for Computational Linguistics.
- Cite (Informal):
- Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data (AlKhamissi et al., CoNLL-BabyLM 2024)
- PDF:
- https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2024.conll-babylm.22.pdf