Khushboo Singh

2025

pdf bib abs
Evaluation of LLMs-based Hidden States as Author Representations for Psychological Human-Centered NLP Tasks
Nikita Soni | Pranav Chitale | Khushboo Singh | Niranjan Balasubramanian | H. Schwartz
Findings of the Association for Computational Linguistics: NAACL 2025

Like most of NLP, models for human-centered NLP tasks—tasks attempting to assess author-level information—predominantly use rep-resentations derived from hidden states of Transformer-based LLMs. However, what component of the LM is used for the representation varies widely. Moreover, there is a need for Human Language Models (HuLMs) that implicitly model the author and provide a user-level hidden state. Here, we systematically evaluate different ways of representing documents and users using different LM and HuLM architectures to predict task outcomes as both dynamically changing states and averaged trait-like user-level attributes of valence, arousal, empathy, and distress. We find that representing documents as an average of the token hidden states performs the best generally. Further, while a user-level hidden state itself is rarely the best representation, we find its inclusion in the model strengthens token or document embeddings used to derive document- and user-level representations resulting in best performances.

Large Language Models (LLMs) are increasingly used in human-centered applications, yet their ability to model diverse psychological constructs is not well understood. In this study, we systematically evaluate a range of Transformer-LMs to predict psychological variables across five major dimensions: affect, substance use, mental health, sociodemographics, and personality. Analyses span three temporal levels—short daily text responses about current affect, text aggregated over two-weeks, and user-level text collected over two years—allowing us to examine how each model’s strengths align with the underlying stability of different constructs. The findings show that mental health signals emerge as the most accurately predicted dimensions (r=0.6) across all temporal scales. At the daily scale, smaller models like DeBERTa and HaRT often performed better, whereas, at longer scales or with greater context, larger model like Llama3-8B performed the best. Also, aggregating text over the entire study period yielded stronger correlations for outcomes, such as age and income. Overall, these results suggest the importance of selecting appropriate model architectures and temporal aggregation techniques based on the stability and nature of the target variable.

Co-authors

Adithya V Ganesan 1

Syeda Mahwish 1

August Håkan Nilsson 1

Richard N Rosenthal 1

Lyle Ungar 1

Vasudha Varadarajan 1

Venues

findings2
ws1

Fix author