Influence-driven Curriculum Learning for Pre-training on Limited Data

Loris Schoenegger; Lukas Thoma; Terra Blevins; Benjamin Roth

Influence-driven Curriculum Learning for Pre-training on Limited Data

Loris Schoenegger, Lukas Thoma, Terra Blevins, Benjamin Roth

Abstract

Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their training data influence, a score which estimates the effect of individual training examples on the model’s output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.

Anthology ID:: 2025.babylm-main.26
Volume:: Proceedings of the First BabyLM Workshop
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:: BabyLM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 356–379
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.26/
DOI:
Bibkey:
Cite (ACL):: Loris Schoenegger, Lukas Thoma, Terra Blevins, and Benjamin Roth. 2025. Influence-driven Curriculum Learning for Pre-training on Limited Data. In Proceedings of the First BabyLM Workshop, pages 356–379, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Influence-driven Curriculum Learning for Pre-training on Limited Data (Schoenegger et al., BabyLM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.26.pdf

PDF Cite Search Fix data