Hanada Taha-Thomure
Also published as: Hanada Taha, Hanada Taha Thomure
2026
A Large and Balanced Multi-Domain Arabic Corpus Annotated for Morphology, Syntax, and Readability
Khalid N. Elmadani | Adel Mahmoud Wizani | Hanada Taha Thomure | Nizar Habash
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Khalid N. Elmadani | Adel Mahmoud Wizani | Hanada Taha Thomure | Nizar Habash
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present BAREC-10M, an expanded version of the Balanced Arabic Readability Evaluation Corpus (BAREC). This new release extends the original 1M-word corpus to 10 million words and broadens its scope to include balanced multi-domain coverage annotated for morphology, syntax, and readability. The corpus integrates 45 sub-corpora drawn from diverse sources, including news, educational materials, literature, children’s texts, and religious discourse. Each text is labeled for domain, readership level, and genre, and automatically analyzed using state-of-the-art morphological and syntactic tools. To enhance coverage of underrepresented varieties, we manually digitized and included children’s materials, magazines, and curriculum-based content. The resulting dataset provides a balanced resource for studying Arabic linguistic variation across styles, audiences, and levels of complexity.
2025
BAREC Demo: Resources and Tools for Sentence-level Arabic Readability Assessment
Kinda Altarbouch | Khalid N. Elmadani | Ossama Obeid | Hanada Taha | Nizar Habash
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Kinda Altarbouch | Khalid N. Elmadani | Ossama Obeid | Hanada Taha | Nizar Habash
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We present BAREC Demo, a web-based system for fine-grained, sentence-level Arabic readability assessment. The demo is part of the Balanced Arabic Readability Evaluation Corpus (BAREC) project, which manually annotated 69,000 sentences (over one million words) from diverse genres and domains using a 19-level readability scale inspired by the Taha/Arabi21 framework, covering reading abilities from kindergarten to postgraduate levels. The project also developed models for automatic readability assessment.The demo provides two main functionalities for educators, content creators, language learners, and researchers: (1) a Search interface to explore the annotated dataset for text selection and resource development, and (2) an Analyze interface, which uses trained models to assign detailed readability labels to Arabic texts at the sentence level.The system and all of its resources are accessible at https://barec.camel-lab.com.
Guidelines for Fine-grained Sentence-level Arabic Readability Annotation
Nizar Habash | Hanada Taha-Thomure | Khalid N. Elmadani | Zeina Zeino | Abdallah Abushmaes
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)
Nizar Habash | Hanada Taha-Thomure | Khalid N. Elmadani | Zeina Zeino | Abdallah Abushmaes
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)
This paper presents the annotation guidelines of the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale resource for fine-grained sentence-level readability assessment in Arabic. BAREC includes 69,441 sentences (1M+ words) labeled across 19 levels, from kindergarten to postgraduate. Based on the Taha/Arabi21 framework, the guidelines were refined through iterative training with native Arabic-speaking educators. We highlight key linguistic, pedagogical, and cognitive factors in determining readability and report high inter-annotator agreement: Quadratic Weighted Kappa 81.8% (substantial/excellent agreement) in the last annotation phase. We also benchmark automatic readability models across multiple classification granularities (19-, 7-, 5-, and 3-level). The corpus and guidelines are publicly available: http://barec.camel-lab.com.
BAREC Shared Task 2025 on Arabic Readability Assessment
Khalid N. Elmadani | Bashar Alhafni | Hanada Taha | Nizar Habash
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Khalid N. Elmadani | Bashar Alhafni | Hanada Taha | Nizar Habash
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment
Khalid N. Elmadani | Nizar Habash | Hanada Taha-Thomure
Findings of the Association for Computational Linguistics: ACL 2025
Khalid N. Elmadani | Nizar Habash | Hanada Taha-Thomure
Findings of the Association for Computational Linguistics: ACL 2025
This paper introduces the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 69,441 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.8%, reflecting a high level of substantial agreement.Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic readability modeling, demonstrating competitive performance across various methods.To support research and education, we make BAREC openly available, along with detailed annotation guidelines and benchmark results: http://barec.camel-lab.com.