Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

Jana Jung; Marlene Lutz; Indira Sen; Markus Strohmaier

Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

Jana Jung, Marlene Lutz, Indira Sen, Markus Strohmaier

Abstract

Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests – originally developed for humans – yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests on 17 LLMs for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests on LLMs are essential before interpreting their scores. Our findings also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.

Anthology ID:: 2026.eacl-long.380
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8143–8173
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.380/
DOI:
Bibkey:
Cite (ACL):: Jana Jung, Marlene Lutz, Indira Sen, and Markus Strohmaier. 2026. Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8143–8173, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality (Jung et al., EACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.380.pdf

PDF Cite Search Fix data