Are the Reasoning Models Good at Automated Essay Scoring?

Lui Yoshida

doi:10.18653/v1/2025.findings-emnlp.445

Are the Reasoning Models Good at Automated Essay Scoring?

Abstract

This study investigates the validity and reliability of reasoning models, specifically OpenAI’s o3-mini and o4-mini, in automated essay scoring (AES) tasks. We evaluated these models’ performance on the TOEFL11 dataset by measuring agreement with expert ratings (validity) and consistency in repeated evaluations (reliability). Our findings reveal two key results: (1) the validity of reasoning models o3-mini and o4-mini is significantly lower than that of a non-reasoning model GPT-4o mini, and (2) the reliability of reasoning models cannot be considered high, with Intraclass Correlation Coefficients (ICC) of approximately 0.7 compared to GPT-4o mini’s 0.95. These results demonstrate that reasoning models, despite their excellent performance on many benchmarks, do not necessarily perform well on specific tasks such as AES. Additionally, we found that few-shot prompting significantly improves performance for reasoning models, while Chain of Thought (CoT) has less impact.

Anthology ID:: 2025.findings-emnlp.445
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8388–8394
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.445/
DOI:: 10.18653/v1/2025.findings-emnlp.445
Bibkey:
Cite (ACL):: Lui Yoshida. 2025. Are the Reasoning Models Good at Automated Essay Scoring?. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8388–8394, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Are the Reasoning Models Good at Automated Essay Scoring? (Yoshida, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.445.pdf
Checklist:: 2025.findings-emnlp.445.checklist.pdf

PDF Cite Search Checklist Fix data