Robert Joshua Reynolds

Also published as: Robert Reynolds

2026

Transformer-based readability classifiers are worse than you think: Evidence from cross-domain Arabic readability assessment
Sarh Alzu’Bi | Robert Reynolds
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

Arabic readability assessment is under-explored compared to English, and existing models are typically evaluated only within the training domain. We introduce the Jordanian School Textbook Corpus (JSTC), 82,512 segments from 240 textbooks spanning grades 1–12, and combine it with DARES to train XGBoost classifiers, fine-tuned CAMeLBERT transformers, and hybrid architectures evaluated both in-domain and on the BAREC out-of-domain benchmark. CAMeLBERT achieves strong in-domain performance (QWK = 0.830) but its cross-domain QWK collapses to 0.085, while XGBoost over 127 handcrafted linguistic features alone maintains the highest cross-domain QWK (0.240); adding [CLS] embeddings to those features actively harms transfer. Probing reveals that CAMeLBERT layers implicitly capture some linguistic features but higher-level signals overwhelm them, and Captum attribution identifies nouns and nominal particles such as al- as the most important tokens. The results argue for prioritizing linguistically-grounded features over contextual embeddings when cross-domain robustness is required.

2025

We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.