You are an LLM teaching a smaller model everything you know: Multi-task pretraining of language models with LLM-designed study plans

Wiktor Kamzela, Mateusz Lango, Ondrej Dusek


Abstract
This paper proposes a multi-task pre-training of language models without any text corpora.The method leverages an existing Large Language Model (LLM) to generate a diverse corpus containing training data for 56 automatically designed tasks and uses generated labels to enhance the training signal.The method does not rely on hidden states or even output distributions of the teacher model, so may be employed in scenarios when the teacher LLM is available only through an API.The conducted experiments show that models trained on the proposed synthetic corpora achieve competitive or superior performance compared to those trained on same-sized human-written texts.
Anthology ID:
2025.babylm-main.33
Volume:
Proceedings of the First BabyLM Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:
BabyLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
469–487
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.33/
DOI:
Bibkey:
Cite (ACL):
Wiktor Kamzela, Mateusz Lango, and Ondrej Dusek. 2025. You are an LLM teaching a smaller model everything you know: Multi-task pretraining of language models with LLM-designed study plans. In Proceedings of the First BabyLM Workshop, pages 469–487, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
You are an LLM teaching a smaller model everything you know: Multi-task pretraining of language models with LLM-designed study plans (Kamzela et al., BabyLM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.33.pdf