A Comparison of Elementary Baselines for BabyLM

Rareș Păpușoi, Sergiu Nisioi


Abstract
This paper explores multiple simple baselines for the BabyLM challenge, covering random models, elementary predictions based on frequency, n-gram language models, LSTM with several tokenizers (BPE, Unigram, SuperBPE), and GPT-BERT, the winning architecture from the prior BabyLM edition. The evaluation is focused on the BLiMP and BLiMP-Supplement benchmarks. Our experiments show that Strict-Small can sometimes outperform Strict, the fact that performance can be highly sensitive to tokenization and the importance of data efficiency. A simple word-frequency baseline scored unexpectedly high, which led to identifying an evaluation artifact in the pipeline: a system that returns identical logits for both sentences in a minimal pair can achieve maximal accuracy.
Anthology ID:
2025.babylm-main.16
Volume:
Proceedings of the First BabyLM Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:
BabyLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
218–225
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.16/
DOI:
Bibkey:
Cite (ACL):
Rareș Păpușoi and Sergiu Nisioi. 2025. A Comparison of Elementary Baselines for BabyLM. In Proceedings of the First BabyLM Workshop, pages 218–225, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
A Comparison of Elementary Baselines for BabyLM (Păpușoi & Nisioi, BabyLM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.16.pdf