Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus

Mohamed Balabel; Injy Hamed; Slim Abdennadher; Ngoc Thang Vu; Özlem Çetinoğlu

Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus

Mohamed Balabel, Injy Hamed, Slim Abdennadher, Ngoc Thang Vu, Özlem Çetinoğlu

Abstract

Code-switching has become a prevalent phenomenon across many communities. It poses a challenge to NLP researchers, mainly due to the lack of available data needed for training and testing applications. In this paper, we introduce a new resource: a corpus of Egyptian- Arabic code-switch speech data that is fully tokenized, lemmatized and annotated for part-of-speech tags. Beside the corpus itself, we provide annotation guidelines to address the unique challenges of annotating code-switch data. Another challenge that we address is the fact that Egyptian Arabic orthography and grammar are not standardized.

Anthology ID:: 2020.lrec-1.489
Volume:: Proceedings of the 12th Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 3973–3977
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.489
DOI:
Bibkey:
Cite (ACL):: Mohamed Balabel, Injy Hamed, Slim Abdennadher, Ngoc Thang Vu, and Özlem Çetinoğlu. 2020. Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 3973–3977, Marseille, France. European Language Resources Association.
Cite (Informal):: Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus (Balabel et al., LREC 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/update-css-js/2020.lrec-1.489.pdf

PDF Cite Search