Oscar Kjell

2026

Language-based assessments have demonstrated high convergent validity with corresponding mental and physical health constructs, however often fail to address discriminant validity - the measure’s ability to distinguish the target construct from related ones. This is a common phenomenon within the domain of mental health, as well as comorbidity with physical health conditions. Identifying key features of individual dimensions of mental and physical health present in language can unlock new avenues of research for natural language processing and psychology. We propose two augmentations to the objective function of the Ridge model, deriving closed-form solutions compatible with Singular Value Decomposition-based solvers, to enforce discriminant validity of off-target constructs using Mean Squared Error (MSE) and Squared Cosine Similarity (SCS,) both having widespread use in contrastive learning. By varying the discrimination strength, we find that a decrease in 0.005 Pearson correlation points can result in a Pearson correlation point increase upwards of 0.132 in discriminant validity for mental and physical health constructs derived from self-reported questionnaires. We see similar improvements across multiple fundamental psychopathology dimensions simultaneously, increasing discriminant validity by 0.012 with stronger increases coming from more noisy, less reliable constructs. Our contributions provide a theoretically grounded path towards improving confidence in language-based assessments in the clinical sector, improving specificity of said assessments to various areas of health.

pdf bib abs

Evaluating Document-Tuned Transformer Representations for Person-level Mental Health Assessment
Aaron Marker | Oscar Kjell | Vasudha Varadarajan | H. Andrew Schwartz
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)

Person-level psychological assessment requires aggregating meaning across many messages from the same individual, a task that document-level training objectives were not explicitly designed for. We present a systematic, empirical comparison between architecturally matched traditional (a) base-transformers and (b) document-tuned-transformers (further contrastively fine-tuned at the document-level, sometimes referred to as "sentence transformers") under otherwise identical conditions. Comparing layer-wise and overall performance across two longitudinal mental health and psychological datasets, we find document-tuned models demonstrated a consistent improvement over base representations (increase in Pearson r of 13.4%, p=.015). Robustness analyses revealed document-tuned models remained more accurate under perturbations to word deletion, synonym replacement, typo injection, and back translation. Further, hedged language (e.g., ’usually’) was more characteristic of outcomes in document-tuned embeddings while abundance (e.g., ’lot’) was more characteristic of base-transformers, suggesting document-tuned models may better capture uncertainty.These results suggest representation choice impacts mental health prediction, document-tuned models often being more adept.

pdf bib abs

While NLP typically treats documents as independent and unordered samples, in longitudinal studies, this assumption rarely holds: documents are nested within authors and ordered in time, forming person-indexed, time-ordered behavioral sequences.Here, we demonstrate the need for and propose a longitudinal modeling and evaluation paradigm that consequently updates four parts of the NLP pipeline: (1) evaluation splits aligned to generalization over people (cross-sectional) and/or time (prospective); (2) accuracy metrics separating between-person differences from within-person dynamics; (3) sequence inputs to incorporate history by default; and (4) model internals that support different coarseness of latent state over histories (pooled summaries, explicit dynamics, or interaction-based models).We demonstrate the issues ensued by traditional pipeline and our proposed improvements on a dataset of 17k daily diary transcripts paired with PTSD symptom severity from 238 participants, finding that traditional document-level evaluation can yield substantially different and sometimes reversed conclusions compared to our ecologically valid modeling and evaluation. We tie our results to a broader discussion motivating a shift from word-sequence evaluation toward behavior-sequence paradigms for NLP.

2025

pdf bib abs

Current speech encoding pipelines often rely on an additional text-based LM to get robust representations of human communication, even though SotA speech-to-text models often have a LM within. This work proposes an approach to improve the LM within an audio model such that the subsequent text-LM is unnecessary. We introduce **WhiSPA** (**Whi**sper with **S**emantic and **P**sychological **A**lignment), which leverages a novel audio training objective: contrastive loss with a language model embedding as a teacher. Using over 500k speech segments from mental health audio interviews, we evaluate the utility of aligning Whisper’s latent space with semantic representations from a text autoencoder (SBERT) and lexically derived embeddings of basic psychological dimensions: emotion and personality. Over self-supervised affective tasks and downstream psychological tasks, WhiSPA surpasses current speech encoders, achieving an average error reduction of 73.4% and 83.8%, respectively. WhiSPA demonstrates that it is not always necessary to run a subsequent text LM on speech-to-text output in order to get a rich psychological representation of human communication.

2024

pdf bib abs

ALBA: Adaptive Language-Based Assessments for Mental Health
Vasudha Varadarajan | Sverker Sikström | Oscar Kjell | H. Andrew Schwartz
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Mental health issues differ widely among individuals, with varied signs and symptoms. Recently, language-based assessments haveshown promise in capturing this diversity, but they require a substantial sample of words per person for accuracy. This work introducesthe task of Adaptive Language-Based Assessment (ALBA), which involves adaptively ordering questions while also scoring an individual’s latent psychological trait using limited language responses to previous questions. To this end, we develop adaptive testing methods under two psychometric measurement theories: Classical Test Theory and Item Response Theory.We empirically evaluate ordering and scoring strategies, organizing into two new methods: a semi-supervised item response theory-basedmethod (ALIRT) and a supervised Actor-Critic model. While we found both methods to improve over non-adaptive baselines, We foundALIRT to be the most accurate and scalable, achieving the highest accuracy with fewer questions (e.g., Pearson r ≈ 0.93 after only 3 questions as compared to typically needing at least 7 questions). In general, adaptive language-based assessments of depression and anxiety were able to utilize a smaller sample of language without compromising validity or large computational costs.