Do Diacritics Matter? Evaluating the Impact of Arabic Diacritics on Tokenization and LLM Benchmarks

Go Inoue; Bashar Alhafni; Nizar Habash; Timothy Baldwin

Do Diacritics Matter? Evaluating the Impact of Arabic Diacritics on Tokenization and LLM Benchmarks

Go Inoue, Bashar Alhafni, Nizar Habash, Timothy Baldwin

Abstract

Diacritics are orthographic marks added to letters to specify pronunciation, disambiguate lexical meanings, or indicate grammatical distinctions. Diacritics can significantly influence language processing tasks, especially in languages like Arabic, where diacritic usage varies widely across domains and contexts. While diacritics provide valuable linguistic information, their presence can increase subword fragmentation during tokenization, potentially degrading the performance of NLP models. In this paper, we systematically analyze the impact of diacritics on tokenization and benchmark task performance across major Large Language Models (LLMs). Our results demonstrate that while modern LLMs show robustness to the limited diacritics naturally found in texts, full diacritization leads to substantially increased token fragmentation and degraded performance, highlighting the need for careful handling of diacritics in the future development of Arabic LLMs.

Anthology ID:: 2026.findings-eacl.22
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 426–442
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.22/
DOI:
Bibkey:
Cite (ACL):: Go Inoue, Bashar Alhafni, Nizar Habash, and Timothy Baldwin. 2026. Do Diacritics Matter? Evaluating the Impact of Arabic Diacritics on Tokenization and LLM Benchmarks. In Findings of the Association for Computational Linguistics: EACL 2026, pages 426–442, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Do Diacritics Matter? Evaluating the Impact of Arabic Diacritics on Tokenization and LLM Benchmarks (Inoue et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.22.pdf
Checklist:: 2026.findings-eacl.22.checklist.pdf

PDF Cite Search Checklist Fix data