Yassine Toughrai

2025

pdf bib abs
Modeling North African Dialects from Standard Languages
Yassine Toughrai | Kamel Smaïli | David Langlois
Proceedings of The Third Arabic Natural Language Processing Conference

Processing North African Arabic dialects presents significant challenges due to high lexical variability, frequent code-switching with French, and the use of both Arabic and Latin scripts. We address this with a phonemebased normalization strategy that maps Arabic and French text into a simplified representation (Arabic rendered in Latin script), reflecting native reading patterns. Using this method, we pretrain BERTbased models on normalized Modern Standard Arabic and French only and evaluate them on Named Entity Recognition (NER) and text classification. Experiments show that normalized standard-language corpora yield competitive performance on North African dialect tasks; in zero-shot NER, Ar_20k surpasses dialectpretrained baselines. Normalization improves vocabulary alignment, indicating that normalized standard corpora can suffice for developing dialect-supportive

pdf bib abs
ABDUL: A New Approach to Build Language Models for Dialects Using Formal Language Corpora Only
Yassine Toughrai | Kamel Smaïli | David Langlois
Proceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025)

Arabic dialects present major challenges for natural language processing (NLP) due to their diglossic nature, phonetic variability, and the scarcity of resources. To address this, we introduce a phoneme-like transcription approach that enables the training of robust language models for North African Dialects (NADs) using only formal language data, without the need for dialect-specific corpora.Our key insight is that Arabic dialects are highly phonetic, with NADs particularly influenced by European languages. This motivated us to develop a novel approach in which we convert Arabic script into a Latin-based representation, allowing our language model, ABDUL, to benefit from existing Latin-script corpora.Our method demonstrates strong performance in multi-label emotion classification and named entity recognition (NER) across various Arabic dialects. ABDUL achieves results comparable to or better than specialized and multilingual models such as DarijaBERT, DziriBERT, and mBERT. Notably, in the NER task, ABDUL outperforms mBERT by 5% in F1-score for Modern Standard Arabic (MSA), Moroccan, and Algerian Arabic, despite using a vocabulary four times smaller than mBERT.

Co-authors

David Langlois 2
Kamel Smaili 2

Venues

Fix author