Abstract
How well does naturally-occurring digital text, such as Tweets, represent sub-dialects of Egyptian Arabic (EA)? This paper focuses on two EA sub-dialects: Cairene Egyptian Arabic (CEA) and Sa’idi Egyptian Arabic (SEA). We use morphological markers from ground-truth dialect surveys as a distance measure across four geo-referenced datasets. Results show that CEA markers are prevalent as expected in CEA geo-referenced tweets, while SEA markers are limited across SEA geo-referenced tweets. SEA tweets instead show a prevalence of CEA markers and higher usage of Modern Standard Arabic. We conclude that corpora intended to represent sub-dialects of EA do not accurately represent sub-dialects outside of the Cairene variety. This finding calls into question the validity of relying on tweets alone to represent dialectal differences.- Anthology ID:
- 2024.vardial-1.4
- Volume:
- Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Marcos Zampieri, Preslav Nakov, Jörg Tiedemann
- Venues:
- VarDial | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 41–55
- Language:
- URL:
- https://aclanthology.org/2024.vardial-1.4
- DOI:
- 10.18653/v1/2024.vardial-1.4
- Cite (ACL):
- Mai Mohamed Eida, Mayar Nassar, and Jonathan Dunn. 2024. How Well Do Tweets Represent Sub-Dialects of Egyptian Arabic?. In Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024), pages 41–55, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- How Well Do Tweets Represent Sub-Dialects of Egyptian Arabic? (Mohamed Eida et al., VarDial-WS 2024)
- PDF:
- https://preview.aclanthology.org/naacl-24-ws-corrections/2024.vardial-1.4.pdf