Toward Dialect-Aware Safety Evaluation for Arabic Large Language Models

Wajdi Zaghouani


Abstract
Large language models (LLMs) are increasingly deployed with safety alignment mechanisms designed to prevent harmful outputs including hate speech, harassment, and unsafe instructions. However, existing safety evaluation frameworks remain heavily centered on English and standardized language varieties, creating a critical gap for languages characterized by extensive dialectal variation. Arabic provides a particularly important case: everyday communication across the Arab world occurs predominantly in regional dialects rather than Modern Standard Arabic (MSA), yet these dialects are systematically underrepresented in alignment training corpora and safety benchmarks.In this paper we introduce the Dialect Safety Gap, defined as systematic variation in LLM safety behavior across dialects of the same language. We argue that this phenomenon arises from the interaction between alignment training procedures and linguistic variation: safety alignment implicitly encodes normative patterns present in training datasets, and when dialectal forms diverge from those patterns, safety behavior degrades through lexical, morphological, and pragmatic mechanisms.We propose a formal framework grounded in algorithmic fairness that links dialect variation to alignment pipeline design, introduce both a binary DSG Score and a magnitude-aware Pairwise Dialect Inconsistency metric, and propose the Dialect-Aware Safety Evaluation Protocol (DASEP) as a practical evaluation framework. We demonstrate the feasibility of dialect-aware evaluation through a controlled, human-annotated prompt-probe experiment across five Arabic variety groups, revealing a structured gradient of safety degradation that correlates with linguistic distance from MSA.
Anthology ID:
2026.trustnlp-main.37
Volume:
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
503–514
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.37/
DOI:
Bibkey:
Cite (ACL):
Wajdi Zaghouani. 2026. Toward Dialect-Aware Safety Evaluation for Arabic Large Language Models. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 503–514, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
Toward Dialect-Aware Safety Evaluation for Arabic Large Language Models (Zaghouani, TrustNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.37.pdf