Evaluating Language Model Character Traits

Francis Rhys Ward, Zejia Yang, Alex Jackson, Randy Brown, Chandler Smith, Grace Beaney Colverd, Louis Alexander Thomson, Raymond Douglas, Patrik Bartak, Andrew Rowan


Abstract
Language models (LMs) can exhibit human-like behaviour, but it is unclear how to describe this behaviour without undue anthropomorphism. We formalise a behaviourist view of LM character traits: qualities such as truthfulness, sycophancy, and coherent beliefs and intentions, which may manifest as consistent patterns of behaviour. Our theory is grounded in empirical demonstrations of LMs exhibiting different character traits, such as accurate and logically coherent beliefs and helpful and harmless intentions. We infer belief and intent from LM behaviour, finding their consistency varies with model size, fine-tuning, and prompting. In addition to characterising LM character traits, we evaluate how these traits develop over the course of an interaction. We find that traits such as truthfulness and harmfulness can be stationary, i.e., consistent over an interaction, in certain contexts but may be reflective in different contexts, meaning they mirror the LM’s behaviour in the preceding interaction. Our formalism enables us to describe LM behaviour precisely and without undue anthropomorphism.
Anthology ID:
2024.findings-emnlp.77
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1423–1443
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.77/
DOI:
10.18653/v1/2024.findings-emnlp.77
Bibkey:
Cite (ACL):
Francis Rhys Ward, Zejia Yang, Alex Jackson, Randy Brown, Chandler Smith, Grace Beaney Colverd, Louis Alexander Thomson, Raymond Douglas, Patrik Bartak, and Andrew Rowan. 2024. Evaluating Language Model Character Traits. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1423–1443, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Evaluating Language Model Character Traits (Ward et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.77.pdf
Data:
 2024.findings-emnlp.77.data.zip