Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

Andreas Säuberli; Diego Frassinelli; Barbara Plank

Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

Andreas Säuberli, Diego Frassinelli, Barbara Plank

Abstract

Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.

Anthology ID:: 2025.bea-1.21
Volume:: Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Ekaterina Kochmar, Bashar Alhafni, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anaïs Tack, Victoria Yaneva, Zheng Yuan
Venues:: BEA | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 266–278
Language:
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bea-1.21/
DOI:
Bibkey:
Cite (ACL):: Andreas Säuberli, Diego Frassinelli, and Barbara Plank. 2025. Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 266–278, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Do LLMs Give Psychometrically Plausible Responses in Educational Assessments? (Säuberli et al., BEA 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bea-1.21.pdf

PDF Cite Search Fix data