How Well Do Tweets Represent Sub-Dialects of Egyptian Arabic?

Mai Mohamed Eida, Mayar Nassar, Jonathan Dunn


Abstract
How well does naturally-occurring digital text, such as Tweets, represent sub-dialects of Egyptian Arabic (EA)? This paper focuses on two EA sub-dialects: Cairene Egyptian Arabic (CEA) and Sa’idi Egyptian Arabic (SEA). We use morphological markers from ground-truth dialect surveys as a distance measure across four geo-referenced datasets. Results show that CEA markers are prevalent as expected in CEA geo-referenced tweets, while SEA markers are limited across SEA geo-referenced tweets. SEA tweets instead show a prevalence of CEA markers and higher usage of Modern Standard Arabic. We conclude that corpora intended to represent sub-dialects of EA do not accurately represent sub-dialects outside of the Cairene variety. This finding calls into question the validity of relying on tweets alone to represent dialectal differences.
Anthology ID:
2024.vardial-1.4
Volume:
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Marcos Zampieri, Preslav Nakov, Jörg Tiedemann
Venues:
VarDial | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
41–55
Language:
URL:
https://aclanthology.org/2024.vardial-1.4
DOI:
Bibkey:
Cite (ACL):
Mai Mohamed Eida, Mayar Nassar, and Jonathan Dunn. 2024. How Well Do Tweets Represent Sub-Dialects of Egyptian Arabic?. In Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024), pages 41–55, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
How Well Do Tweets Represent Sub-Dialects of Egyptian Arabic? (Mohamed Eida et al., VarDial-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.vardial-1.4.pdf
Supplementary material:
 2024.vardial-1.4.SupplementaryMaterial.txt