Adel Mahmoud Wizani
2026
A Large and Balanced Multi-Domain Arabic Corpus Annotated for Morphology, Syntax, and Readability
Khalid N. Elmadani | Adel Mahmoud Wizani | Hanada Taha Thomure | Nizar Habash
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Khalid N. Elmadani | Adel Mahmoud Wizani | Hanada Taha Thomure | Nizar Habash
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present BAREC-10M, an expanded version of the Balanced Arabic Readability Evaluation Corpus (BAREC). This new release extends the original 1M-word corpus to 10 million words and broadens its scope to include balanced multi-domain coverage annotated for morphology, syntax, and readability. The corpus integrates 45 sub-corpora drawn from diverse sources, including news, educational materials, literature, children’s texts, and religious discourse. Each text is labeled for domain, readership level, and genre, and automatically analyzed using state-of-the-art morphological and syntactic tools. To enhance coverage of underrepresented varieties, we manually digitized and included children’s materials, magazines, and curriculum-based content. The resulting dataset provides a balanced resource for studying Arabic linguistic variation across styles, audiences, and levels of complexity.
2024
Data Augmentation through Back-Translation for Stereotypes and Irony Detection
Tom Bourgeade | Silvia Casola | Adel Mahmoud Wizani | Cristina Bosco
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)
Tom Bourgeade | Silvia Casola | Adel Mahmoud Wizani | Cristina Bosco
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)
Complex linguistic phenomena such as stereotypes or irony are still challenging to detect, particularly due to the lower availability of annotated data. In this paper, we explore Back-Translation (BT) as a data augmentation method to enhance such datasets by artificially introducing semantics-preserving variations. We investigate French and Italian as source languages on two multilingual datasets annotated for the presence of stereotypes or irony and evaluate French/Italian, English, andArabic as pivot languages for the BT process. We also investigate cross-translation, i.e., augmenting one language subset of a multilingual dataset with translated instances from the other languages. We conduct an intrinsic evaluation of the quality of back-translated instances, identifying linguistic or translation model-specific errors that may occur with BT. We also perform an extrinsic evaluation of different data augmentation configurations to train a multilingual Transformer-based classifier forstereotype or irony detection on mono-lingual data.