Transformer-based readability classifiers are worse than you think: Evidence from cross-domain Arabic readability assessment

Sarh Alzu’Bi, Robert Reynolds


Abstract
Arabic readability assessment is under-explored compared to English, and existing models are typically evaluated only within the training domain. We introduce the Jordanian School Textbook Corpus (JSTC), 82,512 segments from 240 textbooks spanning grades 1–12, and combine it with DARES to train XGBoost classifiers, fine-tuned CAMeLBERT transformers, and hybrid architectures evaluated both in-domain and on the BAREC out-of-domain benchmark. CAMeLBERT achieves strong in-domain performance (QWK = 0.830) but its cross-domain QWK collapses to 0.085, while XGBoost over 127 handcrafted linguistic features alone maintains the highest cross-domain QWK (0.240); adding [CLS] embeddings to those features actively harms transfer. Probing reveals that CAMeLBERT layers implicitly capture some linguistic features but higher-level signals overwhelm them, and Captum attribution identifies nouns and nominal particles such as al- as the most important tokens. The results argue for prioritizing linguistically-grounded features over contextual embeddings when cross-domain robustness is required.
Anthology ID:
2026.bea-1.52
Volume:
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Ekaterina Kochmar, Bashar Alhafni, Stefano Bannò, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anais Tack, Victoria Yaneva, Zheng Yuan
Venues:
BEA | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
766–776
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.52/
DOI:
Bibkey:
Cite (ACL):
Sarh Alzu’Bi and Robert Reynolds. 2026. Transformer-based readability classifiers are worse than you think: Evidence from cross-domain Arabic readability assessment. In Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), pages 766–776, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Transformer-based readability classifiers are worse than you think: Evidence from cross-domain Arabic readability assessment (Alzu’Bi & Reynolds, BEA 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.52.pdf