Wiktor Kamzela
2025
You are an LLM teaching a smaller model everything you know: Multi-task pretraining of language models with LLM-designed study plans
Wiktor Kamzela
|
Mateusz Lango
|
Ondrej Dusek
Proceedings of the First BabyLM Workshop
This paper proposes a multi-task pre-training of language models without any text corpora.The method leverages an existing Large Language Model (LLM) to generate a diverse corpus containing training data for 56 automatically designed tasks and uses generated labels to enhance the training signal.The method does not rely on hidden states or even output distributions of the teacher model, so may be employed in scenarios when the teacher LLM is available only through an API.The conducted experiments show that models trained on the proposed synthetic corpora achieve competitive or superior performance compared to those trained on same-sized human-written texts.
SRS-Stories: Vocabulary-constrained multilingual story generation for language learning
Wiktor Kamzela
|
Mateusz Lango
|
Ondrej Dusek
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
In this paper, we use large language models to generate personalized stories for language learners, using only the vocabulary they know.The generated texts are specifically written to teach the user new vocabulary by simply reading stories where it appears in context, while at the same time seamlessly reviewing recently learned vocabulary. The generated stories are enjoyable to read and the vocabulary reviewing/learning is optimized by a Spaced Repetition System.The experiments are conducted in three languages: English, Chinese and Polish, evaluating three story generation methods and three strategies for enforcing lexical constraints. The results show that the generated stories are more grammatical, coherent, and provide better examples of word usage than texts generated by the standard constrained beam search approach.