FORGETTER with forgetful hyperparameters and recurring sleeps can continue to learn beyond normal overtfitting limits

Yamamoto Rui, Keiji Miura


Abstract
LLMs suffer from considerable computational costs in training.A more biologically plausible curriculum learning may help to decrease the learning costs.Here we propose a FORGETTER training algorithm,in which a model forgets the variables for optimization after a sleepand the hyperparameters are set toward forgetting memory:rather large weight decay and learning rates as well as small but optimized batch sizes.By limiting minGemma model to 512 input length and speeding up the development cycle,we compared normal and FORGETTER learning algorithms by using more than a thousand different models.Specifically, we found and utilized the “120-rule” that the models with about 120 (Query) heads in total, irrespective of the head number per layer, outperform.The improvement by using the FORGETTER algorithm is far bigger than that by optimizing the model structure.Specifically, FORGETTER models can learn beyond the data size where the normal learning overfits.The FORGETTER also works for CIFAR10 image classification.These results suggest that forgetting can be beneficial for pretraining deep neural networks by avoiding overfitting.
Anthology ID:
2025.babylm-main.7
Volume:
Proceedings of the First BabyLM Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:
BabyLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
91–99
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.7/
DOI:
Bibkey:
Cite (ACL):
Yamamoto Rui and Keiji Miura. 2025. FORGETTER with forgetful hyperparameters and recurring sleeps can continue to learn beyond normal overtfitting limits. In Proceedings of the First BabyLM Workshop, pages 91–99, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
FORGETTER with forgetful hyperparameters and recurring sleeps can continue to learn beyond normal overtfitting limits (Rui & Miura, BabyLM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.7.pdf