A Morpheme-Aware Child-Inspired Language Model

Necva Bölücü, Burcu Can


Abstract
Most tokenization methods in language models rely on subword units that lack explicit linguistic correspondence. In this work, we investigate the impact of using morpheme-based tokens in a small language model, comparing them to the widely used frequency-based method, BPE. We apply the morpheme-based tokenization method to both 10-million and 100-million word datasets from the BabyLM Challenge. Our results show that using a morphological tokenizer improves EWoK (basic world knowledge) performance by around 20% and entity tracking by around 40%, highlighting the impact of morphological information in developing smaller language models. We also apply curriculum learning, in which morphological information is gradually introduced during training, mirroring the vocabulary-building stage in infants that precedes morphological processing. The results are consistent with previous research: curriculum learning yields slight improvements for some tasks, but performance degradation in others.
Anthology ID:
2025.babylm-main.21
Volume:
Proceedings of the First BabyLM Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:
BabyLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
279–287
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.21/
DOI:
Bibkey:
Cite (ACL):
Necva Bölücü and Burcu Can. 2025. A Morpheme-Aware Child-Inspired Language Model. In Proceedings of the First BabyLM Workshop, pages 279–287, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
A Morpheme-Aware Child-Inspired Language Model (Bölücü & Can, BabyLM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.21.pdf