Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Abigail Gurin Schleifer; Moriah Ariely; Beata Beigman Klebanov; Asaf Salman; Giora Alexandron

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Abigail Gurin Schleifer, Moriah Ariely, Beata Beigman Klebanov, Asaf Salman, Giora Alexandron

Abstract

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs’ broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored.We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert.The results show that human–human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best.This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.

Anthology ID:: 2026.bea-1.29
Volume:: Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Ekaterina Kochmar, Bashar Alhafni, Stefano Bannò, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anais Tack, Victoria Yaneva, Zheng Yuan
Venues:: BEA | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 432–442
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.29/
DOI:
Bibkey:
Cite (ACL):: Abigail Gurin Schleifer, Moriah Ariely, Beata Beigman Klebanov, Asaf Salman, and Giora Alexandron. 2026. Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation. In Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), pages 432–442, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation (Gurin Schleifer et al., BEA 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.29.pdf

PDF Cite Search Fix data