Linguistic Units as Tokens: Intrinsic and Extrinsic Evaluation with BabyLM
Achille Fusco, Maria Letizia Piccini Bianchessi, Tommaso Sgrizzi, Asya Zanollo, Cristiano Chesi
Abstract
Tokenization is often treated as a preprocessing step, yet in data-limited settings it directly shapes what a model can learn. We compare four segmentation strategies in the BabyLM Challenge: frequency-based BPE, morphology-aware MorPiece and ParadigmFinder, and syllable-based SylliTok. Evaluation combines two perspectives. First, an intrinsic test on the SIGMORPHON 2022 segmentation benchmark, adapted to English, measures how closely each tokenizer aligns with morpheme boundaries. Second, extrinsic tests train GPT-2 on the 10M BabyLM corpus and evaluate on the 2025 benchmark. No single tokenizer dominates. BPE remains strong on syntax-heavy tasks. ParadigmFinder excels in semantic composition and age-of-acquisition alignment. MorPiece shows advantages in discourse tracking. Morphology-aware tokenizers achieve the best intrinsic segmentation scores, and these gains translate into more robust generalisation in comprehension tasks. These results highlight tokenization as a core modeling decision, with direct consequences for compression, morphology, and the path to humanlike learning.- Anthology ID:
- 2025.babylm-main.35
- Volume:
- Proceedings of the First BabyLM Workshop
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
- Venue:
- BabyLM
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 496–507
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.babylm-main.35/
- DOI:
- 10.18653/v1/2025.babylm-main.35
- Cite (ACL):
- Achille Fusco, Maria Letizia Piccini Bianchessi, Tommaso Sgrizzi, Asya Zanollo, and Cristiano Chesi. 2025. Linguistic Units as Tokens: Intrinsic and Extrinsic Evaluation with BabyLM. In Proceedings of the First BabyLM Workshop, pages 496–507, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Linguistic Units as Tokens: Intrinsic and Extrinsic Evaluation with BabyLM (Fusco et al., BabyLM 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.babylm-main.35.pdf