Quadratic Weighted Kappa Is Not Enough for Evaluating Automated Essay Scoring Models

Salam Albatarni, Tamer Elsayed


Abstract
Quadratic Weighted Kappa (QWK) has been the standard evaluation metric in Automated Essay Scoring (AES) research for over two decades. Despite repeated criticisms highlighting its limitations, the community has largely continued to rely on QWK without adopting alternative metrics. This study aims to encourage a shift toward more suitable evaluation practices by systematically examining QWK’s behavior under three key conditions: dataset size, class imbalance, and score range. Using both a publicly available AES dataset and carefully synthesized datasets, we demonstrate scenarios where QWK produces unstable or misleading results. Our findings highlight the need for more robust evaluation practices and point to alternative metrics, particularly variants of Gwet’s AC2, that offer greater reliability across a variety of conditions.
Anthology ID:
2026.lrec-main.348
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
4447–4456
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.348/
DOI:
Bibkey:
Cite (ACL):
Salam Albatarni and Tamer Elsayed. 2026. Quadratic Weighted Kappa Is Not Enough for Evaluating Automated Essay Scoring Models. International Conference on Language Resources and Evaluation, main:4447–4456.
Cite (Informal):
Quadratic Weighted Kappa Is Not Enough for Evaluating Automated Essay Scoring Models (Albatarni & Elsayed, LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.348.pdf