Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review

Neha Prakriya, Jui-Nan Yen, Cho-Jui Hsieh, Jason Cong


Abstract
We introduce an effective and scalable data selection technique to accelerate the pretraining of large language models (LLMs). Given the variation in quality and informativeness of web-scale corpora, we present the Learn-Focus-Review (LFR) paradigm-a dynamic training approach that adapts to the model’s learning progress. Inspired by human learning techniques like spaced repetition, LFR tracks the model’s learning performance across data instances and prioritizes revisiting challenging and diverse regions of the dataset that are more prone to being forgotten, enabling better retention and more efficient learning. Through experiments spanning over 2200 GPU hours, we show that LFR significantly enhances data efficiency in pretraining while improving downstream performance across commonsense reasoning, question answering, problem-solving, language modeling, and translation tasks. LFR consistently achieves lower perplexity and higher accuracy using just 5%–19% of the training tokens as models trained on the full dataset. Notably, LFR matches the performance of industry-standard Pythia models with up to 2× the parameter count while requiring only 3.2% of the training tokens. Unlike prior work on data selection, LFR models are Chinchilla-optimal demonstrating the effectiveness of our training methodology.
Anthology ID:
2025.conll-1.18
Volume:
Proceedings of the 29th Conference on Computational Natural Language Learning
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Gemma Boleda, Michael Roth
Venues:
CoNLL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
268–290
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.conll-1.18/
DOI:
Bibkey:
Cite (ACL):
Neha Prakriya, Jui-Nan Yen, Cho-Jui Hsieh, and Jason Cong. 2025. Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review. In Proceedings of the 29th Conference on Computational Natural Language Learning, pages 268–290, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review (Prakriya et al., CoNLL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.conll-1.18.pdf