Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment

Ahmed Karim; Qiao Wang; Zheng Yuan

Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment

Abstract

Automated Essay Scoring (AES) systems now attain near–human agreement on some public benchmarks, yet real-world adoption—especially in high-stakes examinations—remains limited. A principal obstacle is that most models output a single score without any accompanying measure of confidence or explanation. We address this gap with conformal prediction, a distribution-free wrapper that equips any classifier with set-valued outputs enjoying formal coverage guarantees. Two open-source Large Language Models—Llama-3 8B and Qwen-2.5 3B—are fine-tuned on three diverse corpora (ASAP, TOEFL11, Cambridge-FCE) and calibrated at a 90% risk level. Reliability is assessed with UAcc, an uncertainty-aware accuracy that rewards models for being both correct and concise. To our knowledge, this is the first work to combine conformal prediction and UAcc for essay scoring. The calibrated models consistently meet the coverage target while keeping prediction sets compact, indicating that open-source, mid-sized LLMs can already support teacher-in-the-loop AES; we discuss scaling and broader user studies as future work.

Anthology ID:: 2025.emnlp-main.992
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19642–19647
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.992/
DOI:
Bibkey:
Cite (ACL):: Ahmed Karim, Qiao Wang, and Zheng Yuan. 2025. Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19642–19647, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment (Karim et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.992.pdf
Checklist:: 2025.emnlp-main.992.checklist.pdf

PDF Cite Search Checklist Fix data