Confounding Factors in Relating Model Performance to Morphology

Wessel Poelman, Thomas Bauwens, Miryam de Lhoneux


Abstract
The extent to which individual language characteristics influence tokenization and language modeling is an open question. Differences in morphological systems have been suggested as both unimportant and crucial to consider (Cotterell et al., 2018; Gerz et al., 2018a; Park et al., 2021, inter alia). We argue this conflicting evidence is due to confounding factors in experimental setups, making it hard to compare results and draw conclusions. We identify confounding factors in analyses trying to answer the question of whether, and how, morphology relates to language modeling. Next, we re-assess three hypotheses by Arnett & Bergen (2025) for why modeling agglutinative languages results in higher perplexities than fusional languages: they look at morphological alignment of tokenization, tokenization efficiency, and dataset size. We show that each conclusion includes confounding factors. Finally, we introduce token bigram metrics as an intrinsic way to predict the difficulty of causal language modeling, and find that they are gradient proxies for morphological complexity that do not require expert annotation. Ultimately, we outline necessities to reliably answer whether, and how, morphology relates to language modeling.
Anthology ID:
2025.emnlp-main.369
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7273–7298
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.369/
DOI:
Bibkey:
Cite (ACL):
Wessel Poelman, Thomas Bauwens, and Miryam de Lhoneux. 2025. Confounding Factors in Relating Model Performance to Morphology. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7273–7298, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Confounding Factors in Relating Model Performance to Morphology (Poelman et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.369.pdf
Checklist:
 2025.emnlp-main.369.checklist.pdf