Pretraining Language Models with LoRA and Artificial Languages

Nalin Kumar; Mateusz Lango; Ondřej Dušek

Pretraining Language Models with LoRA and Artificial Languages

Nalin Kumar, Mateusz Lango, Ondrej Dusek

Abstract

Large language models (LLMs) require a substantial amount of training data, which contrasts with the data-efficient learning observed in humans. In our submission to the BabyLM Challenge, we address this disparity by proposing a parameter-efficient pretraining approach for language acquisition from limited data. Our approach involves initializing the model with token embeddings trained by a shallow model, followed by tuning the non-embedding parameters with non-linguistic data to introduce structural biases. Then, we freeze the resulting model and pretrain it on the 10M-token BabyLM corpus using LoRA adapters. Experiments on small corpora demonstrate that our approach improves upon classic pretraining of the entire model.

Anthology ID:: 2025.babylm-main.37
Volume:: Proceedings of the First BabyLM Workshop
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:: BabyLM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 525–530
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.37/
DOI:
Bibkey:
Cite (ACL):: Nalin Kumar, Mateusz Lango, and Ondrej Dusek. 2025. Pretraining Language Models with LoRA and Artificial Languages. In Proceedings of the First BabyLM Workshop, pages 525–530, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Pretraining Language Models with LoRA and Artificial Languages (Kumar et al., BabyLM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.37.pdf

PDF Cite Search Fix data