@inproceedings{bolucu-can-2025-morpheme,
    title = "A Morpheme-Aware Child-Inspired Language Model",
    author = {B{\"o}l{\"u}c{\"u}, Necva  and
      Can, Burcu},
    editor = "Charpentier, Lucas  and
      Choshen, Leshem  and
      Cotterell, Ryan  and
      Gul, Mustafa Omer  and
      Hu, Michael Y.  and
      Liu, Jing  and
      Jumelet, Jaap  and
      Linzen, Tal  and
      Mueller, Aaron  and
      Ross, Candace  and
      Shah, Raj Sanjay  and
      Warstadt, Alex  and
      Wilcox, Ethan Gotlieb  and
      Williams, Adina",
    booktitle = "Proceedings of the First BabyLM Workshop",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.21/",
    pages = "279--287",
    ISBN = "TODO",
    abstract = "Most tokenization methods in language models rely on subword units that lack explicit linguistic correspondence. In this work, we investigate the impact of using morpheme-based tokens in a small language model, comparing them to the widely used frequency-based method, BPE. We apply the morpheme-based tokenization method to both 10-million and 100-million word datasets from the BabyLM Challenge. Our results show that using a morphological tokenizer improves EWoK (basic world knowledge) performance by around 20{\%} and entity tracking by around 40{\%}, highlighting the impact of morphological information in developing smaller language models. We also apply curriculum learning, in which morphological information is gradually introduced during training, mirroring the vocabulary-building stage in infants that precedes morphological processing. The results are consistent with previous research: curriculum learning yields slight improvements for some tasks, but performance degradation in others."
}Markdown (Informal)
[A Morpheme-Aware Child-Inspired Language Model](https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.21/) (Bölücü & Can, BabyLM 2025)
ACL