Beyond Cairo: Sa’idi Egyptian Arabic Literary Corpus Construction and Analysis

Mai Mohamed Eida, Nizar Habash


Abstract
Egyptian Arabic (EA) NLP resources have mainly focused on Cairene Egyptian Arabic (CEA), leaving sub-dialects like Sa’idi Egyptian Arabic (SEA) underrepresented. This paper introduces the first SEA corpus – an open-source, 4-million-word literary dataset of a dialect spoken by ~30 million Egyptians. To validate its representation, we analyze SEA-specific linguistic features from dialectal surveys, confirming a higher prevalence in our corpus compared to existing EA datasets. Our findings offer insights into SEA’s orthographic representation in morphology, phonology, and lexicon, incorporating CODA* guidelines for normalization.
Anthology ID:
2025.nlp4dh-1.26
Volume:
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Month:
May
Year:
2025
Address:
Albuquerque, USA
Editors:
Mika Hämäläinen, Emily Öhman, Yuri Bizzoni, So Miyagawa, Khalid Alnajjar
Venues:
NLP4DH | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
292–304
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1.26/
DOI:
Bibkey:
Cite (ACL):
Mai Mohamed Eida and Nizar Habash. 2025. Beyond Cairo: Sa’idi Egyptian Arabic Literary Corpus Construction and Analysis. In Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities, pages 292–304, Albuquerque, USA. Association for Computational Linguistics.
Cite (Informal):
Beyond Cairo: Sa’idi Egyptian Arabic Literary Corpus Construction and Analysis (Mohamed Eida & Habash, NLP4DH 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1.26.pdf