Exploring morphology-aware tokenization: A case study on Spanish language modeling

Alba Táboas García; Piotr Przybyła; Leo Wanner

doi:10.18653/v1/2025.emnlp-main.1552

Exploring morphology-aware tokenization: A case study on Spanish language modeling

Alba Táboas García, Piotr Przybyła, Leo Wanner

Abstract

This paper investigates to what extent the integration of morphological information can improve subword tokenization and thus also language modeling performance. We focus on Spanish, a language with fusional morphology, where subword segmentation can benefit from linguistic structure. Instead of relying on purely data-driven strategies like Byte Pair Encoding (BPE), we explore a linguistically grounded approach: training a tokenizer on morphologically segmented data. To do so, we develop a semi-supervised segmentation model for Spanish, building gold-standard datasets to guide and evaluate it. We then use this tokenizer to pre-train a masked language model and assess its performance on several downstream tasks. Our results show improvements over a baseline with a standard tokenizer, supporting our hypothesis that morphology-aware tokenization offers a viable and principled alternative for improving language modeling.

Anthology ID:: 2025.emnlp-main.1552
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30493–30506
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-main.1552/
DOI:: 10.18653/v1/2025.emnlp-main.1552
Bibkey:
Cite (ACL):: Alba Táboas García, Piotr Przybyła, and Leo Wanner. 2025. Exploring morphology-aware tokenization: A case study on Spanish language modeling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30493–30506, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Exploring morphology-aware tokenization: A case study on Spanish language modeling (García et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-main.1552.pdf
Checklist:: 2025.emnlp-main.1552.checklist.pdf

PDF Cite Search Checklist Fix data