Lemmatizing Dialectal Arabic with Sequence-to-Sequence Models

Mostafa Saeed, Nizar Habash


Abstract
Lemmatization for dialectal Arabic poses many challenges due to the lack of orthographic standards and limited morphological analyzers. This work explores the effectiveness of Seq2Seq models for lemmatizing dialectal Arabic, both without analyzers and with their integration. We assess how well these models generalize across dialects and benefit from related varieties. Focusing on Egyptian, Gulf, and Levantine dialects with varying resource levels, our analysis highlights both the potential and limitations of data-driven approaches. The proposed method achieves significant gains over baselines, performing well in both low-resource and dialect-rich scenarios.
Anthology ID:
2025.arabicnlp-main.10
Volume:
Proceedings of The Third Arabic Natural Language Processing Conference
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Kareem Darwish, Ahmed Ali, Ibrahim Abu Farha, Samia Touileb, Imed Zitouni, Ahmed Abdelali, Sharefah Al-Ghamdi, Sakhar Alkhereyf, Wajdi Zaghouani, Salam Khalifa, Badr AlKhamissi, Rawan Almatham, Injy Hamed, Zaid Alyafeai, Areeb Alowisheq, Go Inoue, Khalil Mrini, Waad Alshammari
Venue:
ArabicNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
117–129
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.arabicnlp-main.10/
DOI:
Bibkey:
Cite (ACL):
Mostafa Saeed and Nizar Habash. 2025. Lemmatizing Dialectal Arabic with Sequence-to-Sequence Models. In Proceedings of The Third Arabic Natural Language Processing Conference, pages 117–129, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Lemmatizing Dialectal Arabic with Sequence-to-Sequence Models (Saeed & Habash, ArabicNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.arabicnlp-main.10.pdf