Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations

Xiaoxu Ma, Xiangbo Zhang, Zhenyu Weng


Abstract
Evaluating personality-related tendencies in Large Language Models (LLMs) helps characterize model behavior, compare models beyond task accuracy, and support responsible deployment in socially interactive settings. However, existing questionnaire-based evaluation methods exhibit limited stability and offer little explainability, as their results are highly sensitive to minor variations in prompt phrasing or role-play configurations. To address these limitations, we propose an internal-activation–based approach, termed Persona-Vector Neutrality Interpolation (PVNI), for stable and explainable personality trait evaluation in LLMs. PVNI extracts a persona vector associated with a target personality trait from the model’s internal activations using contrastive prompts. It then estimates the corresponding neutral score by interpolating along the persona vector as an anchor axis, enabling an interpretable comparison between the neutral prompt representation and the persona direction. We provide a theoretical analysis of the effectiveness and generalization properties of PVNI. Extensive experiments across diverse LLMs demonstrate that PVNI yields substantially more stable personality trait evaluations than existing methods, even under questionnaire and role-play variants.
Anthology ID:
2026.findings-acl.803
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16322–16340
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.803/
DOI:
Bibkey:
Cite (ACL):
Xiaoxu Ma, Xiangbo Zhang, and Zhenyu Weng. 2026. Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations. In Findings of the Association for Computational Linguistics: ACL 2026, pages 16322–16340, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations (Ma et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.803.pdf
Checklist:
 2026.findings-acl.803.checklist.pdf