2025
pdf
bib
abs
A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment
Khalid N. Elmadani
|
Nizar Habash
|
Hanada Taha-Thomure
Findings of the Association for Computational Linguistics: ACL 2025
This paper introduces the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 69,441 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.8%, reflecting a high level of substantial agreement.Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic readability modeling, demonstrating competitive performance across various methods.To support research and education, we make BAREC openly available, along with detailed annotation guidelines and benchmark results: http://barec.camel-lab.com.
pdf
bib
abs
Guidelines for Fine-grained Sentence-level Arabic Readability Annotation
Nizar Habash
|
Hanada Taha-Thomure
|
Khalid N. Elmadani
|
Zeina Zeino
|
Abdallah Abushmaes
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)
This paper presents the annotation guidelines of the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale resource for fine-grained sentence-level readability assessment in Arabic. BAREC includes 69,441 sentences (1M+ words) labeled across 19 levels, from kindergarten to postgraduate. Based on the Taha/Arabi21 framework, the guidelines were refined through iterative training with native Arabic-speaking educators. We highlight key linguistic, pedagogical, and cognitive factors in determining readability and report high inter-annotator agreement: Quadratic Weighted Kappa 81.8% (substantial/excellent agreement) in the last annotation phase. We also benchmark automatic readability models across multiple classification granularities (19-, 7-, 5-, and 3-level). The corpus and guidelines are publicly available: http://barec.camel-lab.com.
2024
pdf
bib
abs
Neural Machine Translation between Low-Resource Languages with Synthetic Pivoting
Khalid N. Elmadani
|
Jan Buys
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Training neural models for translating between low-resource languages is challenging due to the scarcity of direct parallel data between such languages. Pivot-based neural machine translation (NMT) systems overcome data scarcity by including a high-resource pivot language in the process of translating between low-resource languages. We propose synthetic pivoting, a novel approach to pivot-based translation in which the pivot sentences are generated synthetically from both the source and target languages. Synthetic pivot sentences are generated through sequence-level knowledge distillation, with the aim of changing the structure of pivot sentences to be closer to that of the source or target languages, thereby reducing pivot translation complexity. We incorporate synthetic pivoting into two paradigms for pivoting: cascading and direct translation using synthetic source and target sentences. We find that the performance of pivot-based systems highly depends on the quality of the NMT model used for sentence regeneration. Furthermore, training back-translation models on these sentences can make the models more robust to input-side noise. The results show that synthetic data generation improves pivot-based systems translating between low-resource Southern African languages by up to 5.6 BLEU points after fine-tuning.
2022
pdf
bib
abs
University of Cape Town’s WMT22 System: Multilingual Machine Translation for Southern African Languages
Khalid N. Elmadani
|
Francois Meyer
|
Jan Buys
Proceedings of the Seventh Conference on Machine Translation (WMT)
The paper describes the University of Cape Town’s submission to the constrained track of the WMT22 Shared Task: Large-Scale Machine Translation Evaluation for African Languages. Our system is a single multilingual translation model that translates between English and 8 South / South East African Languages, as well as between specific pairs of the African languages. We used several techniques suited for low-resource machine translation (MT), including overlap BPE, back-translation, synthetic training data generation, and adding more translation directions during training. Our results show the value of these techniques, especially for directions where very little or no bilingual training data is available.