@inproceedings{rui-miura-2025-forgetter,
    title = "{FORGETTER} with forgetful hyperparameters and recurring sleeps can continue to learn beyond normal overtfitting limits",
    author = "Rui, Yamamoto  and
      Miura, Keiji",
    editor = "Charpentier, Lucas  and
      Choshen, Leshem  and
      Cotterell, Ryan  and
      Gul, Mustafa Omer  and
      Hu, Michael Y.  and
      Liu, Jing  and
      Jumelet, Jaap  and
      Linzen, Tal  and
      Mueller, Aaron  and
      Ross, Candace  and
      Shah, Raj Sanjay  and
      Warstadt, Alex  and
      Wilcox, Ethan Gotlieb  and
      Williams, Adina",
    booktitle = "Proceedings of the First BabyLM Workshop",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.7/",
    pages = "91--99",
    ISBN = "TODO",
    abstract = "LLMs suffer from considerable computational costs in training.A more biologically plausible curriculum learning may help to decrease the learning costs.Here we propose a FORGETTER training algorithm,in which a model forgets the variables for optimization after a sleepand the hyperparameters are set toward forgetting memory:rather large weight decay and learning rates as well as small but optimized batch sizes.By limiting minGemma model to 512 input length and speeding up the development cycle,we compared normal and FORGETTER learning algorithms by using more than a thousand different models.Specifically, we found and utilized the ``120-rule'' that the models with about 120 (Query) heads in total, irrespective of the head number per layer, outperform.The improvement by using the FORGETTER algorithm is far bigger than that by optimizing the model structure.Specifically, FORGETTER models can learn beyond the data size where the normal learning overfits.The FORGETTER also works for CIFAR10 image classification.These results suggest that forgetting can be beneficial for pretraining deep neural networks by avoiding overfitting."
}Markdown (Informal)
[FORGETTER with forgetful hyperparameters and recurring sleeps can continue to learn beyond normal overtfitting limits](https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.7/) (Rui & Miura, BabyLM 2025)
ACL