The Impact of Tokenization Algorithms on Hungarian Language Model Performance
Mátyás Osváth, Máté Norbert Molnár, Roland Gunics, Noémi Ligeti-Nagy
Abstract
Tokenization is a crucial text processing step for preparing input for language models and can contribute to model performance, especially in morphologically rich languages. Currently, Byte Pair Encoding (BPE), WordPiece, and Unigram LM algorithms are predominantly used in language models, but their effects can vary in agglutinative languages. This work compares these tokenization algorithms across varying vocabulary sizes, as well as a modified Unigram LM variant with morphologically informed initialization, on the Hungarian subset of the OSCAR dataset. The evaluation is based on several metrics describing the inferred quality of the tokenizers and on the downstream performance of multiple BERT models on the HuLU benchmark. Results show that BPE produces the most compact and morphologically aligned subword representations, while the modified Unigram LM achieved the best overall downstream performance across tasks. However, differences between methods and vocabulary sizes were generally small and not statistically significant, with the exception of HuCoPA (a task within the HuLU benchmark), which showed sensitivity to both factors. These findings underscore that tokenizer choice and vocabulary design are critical determinants of language model efficiency and performance in morphologically rich languages.- Anthology ID:
- 2026.lrec-main.199
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 2545–2556
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.199/
- DOI:
- Cite (ACL):
- Mátyás Osváth, Máté Norbert Molnár, Roland Gunics, and Noémi Ligeti-Nagy. 2026. The Impact of Tokenization Algorithms on Hungarian Language Model Performance. International Conference on Language Resources and Evaluation, main:2545–2556.
- Cite (Informal):
- The Impact of Tokenization Algorithms on Hungarian Language Model Performance (Osváth et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.199.pdf