Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks

Vani Kanjirangat; Tanja Samardzic; Ljiljana Dolamic; Fabio Rinaldi

Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks

Vani Kanjirangat, Tanja Samardzic, Ljiljana Dolamic, Fabio Rinaldi

Abstract

Dialectal data are characterized by linguistic variation that appears small to humans but has a significant impact on the performance of models. This dialect gap has been related to various factors (e.g., data size, economic and social factors) whose impact, however, turns out to be inconsistent. In this work, we investigate factors impacting the model performance more directly: we correlate Tokenization Parity (TP) and Information Parity (IP), as measures of representational biases in pre-trained multilingual models, with the downstream performance. We compare state-of-the-art decoder-only LLMs with encoder-based models across three tasks: dialect classification, topic classification, and extractive question answering, controlling for varying scripts (Latin vs. non-Latin) and resource availability (high vs. low). Our analysis reveals that TP is a better predictor of the performance on tasks reliant on syntactic and morphological cues (e.g., extractive QA), while IP better predicts performance in semantic tasks (e.g., topic classification). Complementary analyses, including tokenizer behavior, vocabulary coverage, and qualitative insights, reveal that the language support claims of LLMs often might mask deeper mismatches at the script or token level.

Anthology ID:: 2025.emnlp-main.1224
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24003–24021
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1224/
DOI:
Bibkey:
Cite (ACL):: Vani Kanjirangat, Tanja Samardzic, Ljiljana Dolamic, and Fabio Rinaldi. 2025. Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24003–24021, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks (Kanjirangat et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1224.pdf
Checklist:: 2025.emnlp-main.1224.checklist.pdf

PDF Cite Search Checklist Fix data