Ala N. Tak
2026
Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness
Amin Banayeeanzade | Ala N. Tak | Fatemeh Bahrani | Anahita Bolourani | Leonardo Blas | Emilio Ferrara | Jonathan Gratch | Sai Praneeth Karimireddy
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Amin Banayeeanzade | Ala N. Tak | Fatemeh Bahrani | Anahita Bolourani | Leonardo Blas | Emilio Ferrara | Jonathan Gratch | Sai Praneeth Karimireddy
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The ability to control LLMs’ emulated emotional states and personality traits is an essential step in enabling rich, human-centered interactions in socially interactive settings. We introduce PsySET, a Psychologically-informed benchmark to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion and personality domains. Our study spans four models from different LLM families paired with various steering strategies, including prompting, fine-tuning, and representation engineering. Our results indicate that prompting is consistently effective but limited in intensity control, whereas vector injections achieve finer controllability while slightly reducing output quality. Moreover, we explore the trustworthiness of steered LLMs by assessing safety, truthfulness, fairness, and ethics, highlighting potential side effects and behavioral shifts. Notably, we observe idiosyncratic effects; for instance, even a positive emotion like joy can degrade robustness to adversarial factuality, lower privacy awareness, and increase preferential bias. Meanwhile, anger predictably elevates toxicity yet strengthens leakage resistance. Our framework establishes the first holistic evaluation of emotion and personality steering, offering insights into its interpretability and reliability for socially interactive applications.
2025
Mechanistic Interpretability of Emotion Inference in Large Language Models
Ala N. Tak | Amin Banayeeanzade | Anahita Bolourani | Mina Kian | Robin Jia | Jonathan Gratch
Findings of the Association for Computational Linguistics: ACL 2025
Ala N. Tak | Amin Banayeeanzade | Anahita Bolourani | Mina Kian | Robin Jia | Jonathan Gratch
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) show promising capabilities in predicting human emotions from text. However, the mechanisms through which these models process emotional stimuli remain largely unexplored. Our study addresses this gap by investigating how autoregressive LLMs infer emotions, showing that emotion representations are functionally localized to specific regions in the model. Our evaluation includes diverse model families and sizes, and is supported by robustness checks. We then show that the identified representations are psychologically plausible by drawing on cognitive appraisal theory—a well-established psychological framework positing that emotions emerge from evaluations (appraisals) of environmental stimuli. By causally intervening on construed appraisal concepts, we steer the generation and show that the outputs align with theoretical and intuitive expectations. This work highlights a novel way to causally intervene and control emotion inference, potentially benefiting safety and alignment in sensitive affective domains.