Guidelines and Framework for a Large Scale Arabic Diacritized Corpus
Wajdi Zaghouani, Houda Bouamor, Abdelati Hawwari, Mona Diab, Ossama Obeid, Mahmoud Ghoneim, Sawsan Alqahtani, Kemal Oflazer
Abstract
This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations.- Anthology ID:
- L16-1577
- Volume:
- Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
- Month:
- May
- Year:
- 2016
- Address:
- Portorož, Slovenia
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 3637–3643
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/L16-1577/
- DOI:
- Cite (ACL):
- Wajdi Zaghouani, Houda Bouamor, Abdelati Hawwari, Mona Diab, Ossama Obeid, Mahmoud Ghoneim, Sawsan Alqahtani, and Kemal Oflazer. 2016. Guidelines and Framework for a Large Scale Arabic Diacritized Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3637–3643, Portorož, Slovenia. European Language Resources Association (ELRA).
- Cite (Informal):
- Guidelines and Framework for a Large Scale Arabic Diacritized Corpus (Zaghouani et al., LREC 2016)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/L16-1577.pdf