Abstract
Arabic dialects are the non-standard varieties of Arabic commonly spoken – and increasingly written on social media – across the Arab world. Arabic dialects do not have standard orthographies, a challenge for natural language processing applications. In this paper, we present the MADAR CODA Corpus, a collection of 10,000 sentences from five Arabic city dialects (Beirut, Cairo, Doha, Rabat, and Tunis) represented in the Conventional Orthography for Dialectal Arabic (CODA) in parallel with their raw original form. The sentences come from the Multi-Arabic Dialect Applications and Resources (MADAR) Project and are in parallel across the cities (2,000 sentences from each city). This publicly available resource is intended to support research on spelling correction and text normalization for Arabic dialects. We present results on a bootstrapping technique we use to speed up the CODA annotation, as well as on the degree of similarity across the dialects before and after CODA annotation.- Anthology ID:
- 2020.lrec-1.508
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4130–4138
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.508
- DOI:
- Cite (ACL):
- Fadhl Eryani, Nizar Habash, Houda Bouamor, and Salam Khalifa. 2020. A Spelling Correction Corpus for Multiple Arabic Dialects. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4130–4138, Marseille, France. European Language Resources Association.
- Cite (Informal):
- A Spelling Correction Corpus for Multiple Arabic Dialects (Eryani et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/landing_page/2020.lrec-1.508.pdf