Abstract
We describe a corpus of social media posts that include utterances in Arabizi, a Roman-script rendering of Arabic, mixed with other languages, notably English, French, and Arabic written in the Arabic script. We manually annotated a subset of the texts with word-level language IDs; this is a non-trivial task due to the nature of mixed-language writing, especially on social media. We developed classifiers that can accurately predict the language ID tags. Then, we extended the word-level predictions to identify sentences that include Arabizi (and code-switching), and applied the classifiers to the raw corpus, thereby harvesting a large number of additional instances. The result is a large-scale dataset of Arabizi, with precise indications of code-switching between Arabizi and English, French, and Arabic.- Anthology ID:
- 2022.wanlp-1.18
- Volume:
- Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates (Hybrid)
- Editors:
- Houda Bouamor, Hend Al-Khalifa, Kareem Darwish, Owen Rambow, Fethi Bougares, Ahmed Abdelali, Nadi Tomeh, Salam Khalifa, Wajdi Zaghouani
- Venue:
- WANLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 194–204
- Language:
- URL:
- https://aclanthology.org/2022.wanlp-1.18
- DOI:
- 10.18653/v1/2022.wanlp-1.18
- Cite (ACL):
- Safaa Shehadi and Shuly Wintner. 2022. Identifying Code-switching in Arabizi. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), pages 194–204, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Cite (Informal):
- Identifying Code-switching in Arabizi (Shehadi & Wintner, WANLP 2022)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2022.wanlp-1.18.pdf