Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus
Mohamed Balabel, Injy Hamed, Slim Abdennadher, Ngoc Thang Vu, Özlem Çetinoğlu
Abstract
Code-switching has become a prevalent phenomenon across many communities. It poses a challenge to NLP researchers, mainly due to the lack of available data needed for training and testing applications. In this paper, we introduce a new resource: a corpus of Egyptian- Arabic code-switch speech data that is fully tokenized, lemmatized and annotated for part-of-speech tags. Beside the corpus itself, we provide annotation guidelines to address the unique challenges of annotating code-switch data. Another challenge that we address is the fact that Egyptian Arabic orthography and grammar are not standardized.- Anthology ID:
- 2020.lrec-1.489
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 3973–3977
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.489
- DOI:
- Cite (ACL):
- Mohamed Balabel, Injy Hamed, Slim Abdennadher, Ngoc Thang Vu, and Özlem Çetinoğlu. 2020. Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3973–3977, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus (Balabel et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2020.lrec-1.489.pdf