The Impact of Tokenization Algorithms on Hungarian Language Model Performance

Mátyás Osváth; Máté Norbert Molnár; Roland Gunics; Noémi Ligeti-Nagy

The Impact of Tokenization Algorithms on Hungarian Language Model Performance

Mátyás Osváth, Máté Norbert Molnár, Roland Gunics, Noémi Ligeti-Nagy

Abstract

Tokenization is a crucial text processing step for preparing input for language models and can contribute to model performance, especially in morphologically rich languages. Currently, Byte Pair Encoding (BPE), WordPiece, and Unigram LM algorithms are predominantly used in language models, but their effects can vary in agglutinative languages. This work compares these tokenization algorithms across varying vocabulary sizes, as well as a modified Unigram LM variant with morphologically informed initialization, on the Hungarian subset of the OSCAR dataset. The evaluation is based on several metrics describing the inferred quality of the tokenizers and on the downstream performance of multiple BERT models on the HuLU benchmark. Results show that BPE produces the most compact and morphologically aligned subword representations, while the modified Unigram LM achieved the best overall downstream performance across tasks. However, differences between methods and vocabulary sizes were generally small and not statistically significant, with the exception of HuCoPA (a task within the HuLU benchmark), which showed sensitivity to both factors. These findings underscore that tokenizer choice and vocabulary design are critical determinants of language model efficiency and performance in morphologically rich languages.

Anthology ID:: 2026.lrec-main.199
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 2545–2556
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.199/
DOI:
Bibkey:
Cite (ACL):: Mátyás Osváth, Máté Norbert Molnár, Roland Gunics, and Noémi Ligeti-Nagy. 2026. The Impact of Tokenization Algorithms on Hungarian Language Model Performance. International Conference on Language Resources and Evaluation, main:2545–2556.
Cite (Informal):: The Impact of Tokenization Algorithms on Hungarian Language Model Performance (Osváth et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.199.pdf

PDF Cite Search Fix data