Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning
Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, Michalis Vazirgiannis
Abstract
Curriculum learning—organizing training data from easy to hard—has improved efficiency across machine learning domains, yet remains underexplored for language model pretraining. We present the first systematic investigation of curriculum learning in LLM pretraining, with over 200 models trained on up to 100B tokens across three strategies: vanilla curriculum learning, pacing-based sampling, and interleaved curricula, guided by six difficulty metrics spanning linguistic and information-theoretic properties. We evaluate performance on eight benchmarks under three realistic scenarios: limited data, unlimited data, and continual training. Our experiments show that curriculum learning consistently accelerates convergence in early and mid-training phases, reducing training steps by 18-45% to reach baseline performance. When applied as a warmup strategy before standard random sampling, curriculum learning yields sustained improvements up to 3.5%. We identify compression ratio, lexical diversity (MTLD), and readability (Flesch Reading Ease) as the most effective difficulty signals. Our findings demonstrate that data ordering—orthogonal to existing data selection methods—provides a practical mechanism for more efficient LLM pretraining.- Anthology ID:
- 2026.eacl-long.271
- Volume:
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5776–5794
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.271/
- DOI:
- Cite (ACL):
- Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, and Michalis Vazirgiannis. 2026. Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5776–5794, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning (Zhang et al., EACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.271.pdf