Is One Dataset Enough for Evaluation? Studying Generalizability of Automated Essay Scoring Models

Sohaila Eltanbouly, Marwan Sayed, Tamer Elsayed


Abstract
Automated Essay Scoring (AES) has made significant advancements in writing assessment. Recently, cross-prompt AES has gained attention because of its focus on generalizing to unseen prompts. Despite the promise of these advancements, a critical question remains: how generalizable and robust are those models when applied to diverse datasets? This study assesses the generalizability of eight cross-prompt AES models across three different datasets. We employ two experimental setups: the within-dataset approach, where both training and testing occur on the same dataset, and the cross-dataset approach, which challenges the models by evaluating their performance on previously unseen datasets. The experimental results show significant performance inconsistencies, highlighting that relying on a single dataset is insufficient for building robust and generalizable AES systems.
Anthology ID:
2026.lrec-main.29
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
431–440
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.29/
DOI:
Bibkey:
Cite (ACL):
Sohaila Eltanbouly, Marwan Sayed, and Tamer Elsayed. 2026. Is One Dataset Enough for Evaluation? Studying Generalizability of Automated Essay Scoring Models. International Conference on Language Resources and Evaluation, main:431–440.
Cite (Informal):
Is One Dataset Enough for Evaluation? Studying Generalizability of Automated Essay Scoring Models (Eltanbouly et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.29.pdf