Identifying Code-switching in Arabizi

Safaa Shehadi, Shuly Wintner


Abstract
We describe a corpus of social media posts that include utterances in Arabizi, a Roman-script rendering of Arabic, mixed with other languages, notably English, French, and Arabic written in the Arabic script. We manually annotated a subset of the texts with word-level language IDs; this is a non-trivial task due to the nature of mixed-language writing, especially on social media. We developed classifiers that can accurately predict the language ID tags. Then, we extended the word-level predictions to identify sentences that include Arabizi (and code-switching), and applied the classifiers to the raw corpus, thereby harvesting a large number of additional instances. The result is a large-scale dataset of Arabizi, with precise indications of code-switching between Arabizi and English, French, and Arabic.
Anthology ID:
2022.wanlp-1.18
Volume:
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Houda Bouamor, Hend Al-Khalifa, Kareem Darwish, Owen Rambow, Fethi Bougares, Ahmed Abdelali, Nadi Tomeh, Salam Khalifa, Wajdi Zaghouani
Venue:
WANLP
SIG:
SIGARAB
Publisher:
Association for Computational Linguistics
Note:
Pages:
194–204
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2022.wanlp-1.18/
DOI:
10.18653/v1/2022.wanlp-1.18
Bibkey:
Cite (ACL):
Safaa Shehadi and Shuly Wintner. 2022. Identifying Code-switching in Arabizi. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), pages 194–204, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Identifying Code-switching in Arabizi (Shehadi & Wintner, WANLP 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2022.wanlp-1.18.pdf