Workshop on Computational Linguistics and Clinical Psychology (2026)


up

pdf (full)
bib (full)
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)

Suicide is a major public health concern, underscoring the importance of understanding communication practices used in crisis intervention. Text-based crisis services are increasingly used, yet little is known about how counselors construct messages across encounters. One understudied feature of this setting is counselor text reuse, or the repeated use of identical or highly similar message content across different clients. Although reuse may support efficiency and consistency, it may raise questions about how personalised responses are across counselors. This study provides a descriptive analysis of counselor text reuse in a large dataset of 4.7 million messages of real-time text-based crisis counseling conversations. Across 136 counselors, mean message similarity was very low, indicating little overall text reuse for most counselors. However, 103 counselors showed at least one instance of detected reuse, and a smaller subset demonstrated more consistent reuse. Reuse was also positively associated with counselor encounter volume across measures of reuse. Frequently reused longer passages primarily involved structured coping-oriented or psychoeducational content, such as coping strategies, grounding exercises, self-care tips, and relaxation techniques. The findings suggest that counselor text reuse increased with encounter volume, but average levels of reuse were low across counselors and they provide a foundation for future work examining associations with service delivery and client outcomes.
Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom exacerbation. Large language models (LLMs) offer new possibilities for automating this process, as they require annotated data primarily for evaluation rather than training. In this paper, we present a novel automated, multi-agent LLM pipeline for the fine-grained, multi-label extraction of language suggestive of delusional beliefs, associated affective responses, and behavioral responses from transcripts of naturalistic audio diaries collected from people with moderate persecutory ideation. Evaluating an ensemble of three foundation models, we demonstrate that detailed diagnostic prompt instructions successfully reduce false positives for delusional theme classification, but also constrain the interpretation of affective or behavioral responses. Furthermore, comparing multi-agent adjudication frameworks reveals a critical divergence from standard NLP benchmarks: complex conversational debate between agents diminishes accuracy on clinically ambiguous text by inducing premature consensus. Instead, majority voting establishes robust performance (Micro F1 of 0.872 and 0.779 for delusion detection and classification respectively). This work provides a validated and scalable pipeline for the automated detection and characterization of content suggesting delusional beliefs in naturalistic speech.
When large language models simulate patients or clients, they tend to produce cooperative dialogue, premature emotional insight, and rapid resolution. These defaults undermine clinical training, where the pedagogical value lies in sustained difficulty. We describe Clinical Prompt Engineering (CPE), a methodology developed by a multidisciplinary team of clinician-researchers and prompt engineering experts within the [ProjectName] project. CPE encodes clinical knowledge directly into prompt design: each simulated character is constructed through layered psychological profiles, explicit contingency rules linking interactional events to internal states, and enforced non-linear emotional trajectories that resist the model’s pull toward resolution. The methodology has been applied across several clinical training simulations involving over 300 participants in formal studies and iterative pilot phases. Each simulated character is embedded within a multi-agent training environment that provides real-time reflective guidance during the interaction and structured, clinically informed feedback afterward. We illustrate the approach through Talking with Lia, a Hebrew-language simulation in which parents practice responding to a seven-year-old child during repeated missile alerts and forced sheltering. The simulation was deployed within the first week of an acute security crisis in Israel in Winter 2026. Of 132 sessions initiated organically through professional networks, 42 were completed; qualitative feedback emphasized the simulation’s difficulty as pedagogically meaningful. Because CPE operates at the level of prompt design, it can be developed by clinician-researcher teams and adapted to new populations, developmental stages, and crisis contexts, potentially extending access to expert-informed training beyond the settings where such expertise is typically available. Where much computational work in clinical psychology has focused on classifying mental health states from text, CPE addresses a complementary task: whether clinicians can respond effectively to those states as they shift in real time. The next step is testing whether the skills practiced in simulation transfer to real interactions.
Language-based assessments have demonstrated high convergent validity with corresponding mental and physical health constructs, however often fail to address discriminant validity - the measure’s ability to distinguish the target construct from related ones. This is a common phenomenon within the domain of mental health, as well as comorbidity with physical health conditions. Identifying key features of individual dimensions of mental and physical health present in language can unlock new avenues of research for natural language processing and psychology. We propose two augmentations to the objective function of the Ridge model, deriving closed-form solutions compatible with Singular Value Decomposition-based solvers, to enforce discriminant validity of off-target constructs using Mean Squared Error (MSE) and Squared Cosine Similarity (SCS,) both having widespread use in contrastive learning. By varying the discrimination strength, we find that a decrease in 0.005 Pearson correlation points can result in a Pearson correlation point increase upwards of 0.132 in discriminant validity for mental and physical health constructs derived from self-reported questionnaires. We see similar improvements across multiple fundamental psychopathology dimensions simultaneously, increasing discriminant validity by 0.012 with stronger increases coming from more noisy, less reliable constructs. Our contributions provide a theoretically grounded path towards improving confidence in language-based assessments in the clinical sector, improving specificity of said assessments to various areas of health.
We present MHRoBERT (Multistream HEAT over Recurrence over BERT), a hierarchical transformer architecture for longitudinal mental health monitoring that models self- and mutual excitation patterns in linguistic and temporal data across multivariate event streams relating to an individual’s mental health. To supply the model with complementary perspectives on each post, we apply a Large Language Model (LLM) based annotation to extract three streams from social media posts: emotional states, personal life events, and mental health symptoms. A central finding is that multi-task learning with these automatically-generated stream labels provides substantial, consistent improvements across all model architectures evaluated. Multistream information further consistently benefits simpler models not explicitly designed to exploit it: LLM baselines incorporating stream annotations improve macro F1 by 12.6% over text-only prompting. These results have direct implications for the CLPsych Shared Task on Moments of Change detection: multistream auxiliary supervision yields consistent, substantial gains regardless of architecture, suggesting it is a simple and portable strategy that future systems can readily adopt with minimal architectural changes. MHRoBERT additionally produces interpretable learned parameters across streams, revealing temporal interaction patterns between mental health indicators.
As Large Language Models (LLMs) demonstrate strong performance on clinical benchmarks, it remains unclear whether this reflects true patient-specific reasoning or reliance on generalized symptom patterns. To address this gap, we evaluate LLMs on a counseling competency benchmark to assess their use of patient-specific contextual information. Through controlled experiments with ablation experiments, role framing, Thread-of-Thought (ThoT) prompting, and input perturbations, we find that removing contextual details results in only modest performance drops, and predictions remain stable under input variations, indicating limited sensitivity to context. Although structured prompting increases explicit mention of patient details, it does not improve answer accuracy. Error analysis reveals systematic patterns where models favor general clinical associations over context-specific cues, even when such cues are correctly identified during intermediate reasoning. Our findings suggest that achieving passing-level performance does not guarantee context-sensitive decision-making revealing an important gap between apparent clinical competence and actual contextual reasoning. This indicates the need for evaluation frameworks that directly test context integration in mental health applications.
Globally, the incidence of depression and anxiety continues to rise, and the importance of mental health assessment scales as diagnostic tools has grown accordingly. Researchers are increasingly employing generative AI to produce large volumes of items and entire scales, which in turn elevates the costs of validating their reliability and validity. In this study, we used four open-weight LLMs to complete the GAD-7 and PHQ-9, varying prompts, sampling temperature, and dynamic contextual scenarios to emulate realistic human response patterns. Using multi-group confirmatory factor analysis, differential item functioning analyses, and other psychometric methods, we evaluate the factor structure of LLM-generated responses and assess measurement invariance relative to human responses. Our findings reveal a critical paradox: although open-weight LLMs exhibit exceptionally high internal consistency, they demonstrate severe structural mismatch and fail to achieve scalar measurement invariance against human baselines. Furthermore, pervasive differential item functioning and extreme prompt fragility indicate that these models rely on superficial, stereotype-driven semantic matching rather than simulating stable latent psychological dynamics.
Manual annotation of mental health recovery narratives is slow and emotionally demanding, which limits the scalability of the digital mental health resource. A framework exists to characterise such narratives, called INCRESE, but there are currently no methods to automatically annotate the characteristics defined in INCRESE. We benchmarked the ability of support vector classifiers to annotate INCRESE characteristics when trained with three families of text representations: bag of words, GloVe static embeddings, and BERT contextual embeddings, using a dataset of 355 mental health recovery narratives. Characteristics related to diagnosis and turning points achieved a balanced accuracy greater than 0.67. Characteristics related to content warnings achieved a balanced accuracy of 0.72 but showed poor recall, which may be harmful for readers because it could lead to unsolicited exposure to sensitive content such as abuse or sexual violence. The lived-experience advisors endorsed the project objectives and addressed challenges of characteristic prioritization, adding insights not visible from quantitative metrics alone.
Clinical NLP increasingly relies on electronic health record (EHR) datato detect suicidal behaviors, treating clinical documentation as morereliable ground truth than social media. We argue that this framingobscures how EHR-based suicidality datasets encode a particularoperationalization of suicidality, shaped by who authors the data,how episodes are bounded, and how ambiguity is resolved. We groundthis argument in a case study of the ScAN dataset,built over MIMIC-III clinical notes. We show how governanceconstraints, ICD-based cohort selection, single-annotator labeling,and hospital-stay-level aggregation produce labels that foregroundclinician judgment, treat suicidality as a bounded episode, andassume that intent can be reliably inferred from documentation. Alinguistic analysis demonstrates that identical labels subsumeheterogeneous clinical framings differing in temporality, negation,and uncertainty, and that labeling patterns differ across insurancestatus. We argue the clinical NLP community should examine theassumptions embedded in suicidality datasets before interpretingtheir labels as ground truth.
AI systems for mental health are developed predominantly using data from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) populations, raising concerns about their validity, fairness, and generalizability across geo-cultural contexts. This limitation is especially consequential in mental health, where linguistic expression, symptom presentation, help-seeking behavior, and access to care vary substantially across populations. We argue that culturally responsive AI mental health systems require explicit attention to culture throughout the development lifecycle, from data collection to training and deployment. We present a sociotechnical framework for developing culturally responsive AI mental health applications to provide AI researchers and practitioners with an actionable roadmap for building more equitable, reliable, and contextually appropriate mental health technologies.
AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.
Tuberculosis (TB) remains a major global health challenge, and treatment adherence continues to be difficult despite the availability of effective medication. While Digital Adherence Technologies (DATs) have improved monitoring and care coordination, prior deployments highlight unmet needs for timely, personalized, and emotionally supportive communication outside clinical settings. We develop and iteratively refine a Spanish-language TB treatment-support chatbot through multiple rounds of internal expert evaluation. The system separates three core functions: (i) TB information support grounded in curated resources, (ii) coping-oriented support inspired by Dialectical Behavior Therapy (DBT), and (iii) safety-critical crisis handling via a deterministic, non-generative pathway. These components are implemented within a routed architecture with shared conversational state. Iterative evaluation identified recurring failure modes in unstructured conversational systems, including weak grounding, poor multi-turn continuity, and inconsistent safety behavior. Addressing these issues motivated explicit routing, state tracking, and task-specific prompting. Our findings suggest that in clinical support settings, reliable conversational behavior depends on structured interaction design and explicit control over routing, memory, and safety, rather than on model capability alone.
Large language models (LLMs) show promise in generating supportive responses for mental health and counseling applications. However, their responses often lack cultural sensitivity, contextual grounding, and clinically appropriate guidance. This work addresses the gap of how to systematically incorporate domain-specific, clinically validated knowledge into LLMs to improve counseling quality. We utilize and compare two approaches, retrieval-augmented generation (RAG) and a knowledge graph (KG)–based method, designed to support para-counselors. Our KG is constructed manually and clinically validated, capturing causal relationships between stressors, interventions, and outcomes, with contributions from multidisciplinary people. We evaluated multiple LLMs in both settings using BERTScore F1 and SBERT cosine similarity, as well as human evaluation across five metrics, which is designed to directly measure the effectiveness of counseling beyond similarity at the surface level. The results show that KG-based approaches consistently improve contextual relevance, clinical appropriateness, and practical usability compared to RAG alone, demonstrating that structured, expert-validated knowledge plays a critical role in addressing LLMs limitations in counseling tasks.
Person-level psychological assessment requires aggregating meaning across many messages from the same individual, a task that document-level training objectives were not explicitly designed for. We present a systematic, empirical comparison between architecturally matched traditional (a) base-transformers and (b) document-tuned-transformers (further contrastively fine-tuned at the document-level, sometimes referred to as "sentence transformers") under otherwise identical conditions. Comparing layer-wise and overall performance across two longitudinal mental health and psychological datasets, we find document-tuned models demonstrated a consistent improvement over base representations (increase in Pearson r of 13.4%, p=.015). Robustness analyses revealed document-tuned models remained more accurate under perturbations to word deletion, synonym replacement, typo injection, and back translation. Further, hedged language (e.g., ’usually’) was more characteristic of outcomes in document-tuned embeddings while abundance (e.g., ’lot’) was more characteristic of base-transformers, suggesting document-tuned models may better capture uncertainty.These results suggest representation choice impacts mental health prediction, document-tuned models often being more adept.
Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable cues. We present a systematic feature-based analysis framework leveraging perceptually grounded acoustic and linguistic characteristics, including prosody, vocal quality, semantic coherence, syntactic structure, and sarcasm. Using statistical analysis and interpretable machine learning (XGBoost with SHAP and LIME), we examine associations between speech features and validated symptom measures of depression, anxiety, and ADHD. Evaluated on both controlled benchmark datasets (StressID, DAIC-WOZ, Androids, EATD) and a real-world clinical dataset, the framework reveals stable and consistent relationships between symptom severity and vocal irregularities (e.g., shimmer, jitter), lexical–syntactic patterns, and affective tone. An ablation study conducted across all datasets further identifies the most informative feature groups. This work explores a transparent and clinically interpretable approach to speech-based mental health analysis.
Cognitive distortions, distorted patterns of thinking, have been increasingly studied in computational mental health research. Although they are related to many, if not all, mental health disorders, most existing studies focus primarily on depression. In this work, we explore distortion profiles across multiple mental health conditions. We analyzed a large Reddit-based dataset containing posts from ten self-reported mental health groups as well as a control group using both an n-gram-based method and a fine-tuned transformer model for detecting cognitive distortions. The mental health groups, both when pooled together and when examined individually, show a higher prevalence of cognitive distortions compared to the control group, with the effect sizes ranging from small to moderate. When comparing distortion profiles of different mental health conditions, we observe largely similar patterns, but with some conditions showing an overall higher frequency of distortions than others. These findings suggest that even relatively simple methods can be suitable for exploratory analyses that reveal group-level trends in large-scale mental health text data.
Large language models (LLMs) are increasingly applied to automatic personality assessment, yet most prior work relies on coarse binary labels and direct domain-level predictions, limiting interpretability and ignoring the hierarchical facet structure of personality. In this study, we implement a structured prompting approach with three complementary objectives: direct domain-level prediction, fine-grained facet-level prediction, and domain-level prediction informed by facet outputs. All predictions use a five-level ordinal label scheme, capturing a continuum from very low to very high trait expression. Across all prompt types, we adopt an error-guided self-refinement procedure using in-context learning (ICL) to guide the model toward more accurate predictions. Zero-shot prompts assess baseline performance, while one-shot prompts incorporate a single demonstration example selected through the refinement procedure. Our framework evaluates both domain- and facet-level predictions, enabling examination of how prediction granularity and targeted exemplar selection influence LLM inference. By combining hierarchical domain-facet relationships with structured prompting and refinement, this work aims to provide a systematic approach for interpretable and principled LLM-based personality assessment from long-form life narratives.
Reflective listening is a core counseling skill that supports effective communication in mental and behavioral health. Understanding how this skill develops over time is critical for designing scalable training and feedback systems.In this paper, we study how counseling trainees develop reflective listening skills over time. Using a real-world dataset of 6,196 trainee responses, we model responses as trajectories in semantic embedding space and apply residual embeddings and similarity-based metrics to quantify week-to-week learning progression.Our analyses reveal systematic changes, including increased semantic alignment and reduced variability, consistent with consolidation of reflective listening skills. We further show that these trajectory patterns are accompanied by subtle linguistic shifts associated with effective counseling practice.
Recent advances in artificial intelligence (AI) and social media data have led to growing optimism about the ability to detect suicide risk at scale. However, the empirical foundations of this work remain unclear. This article provides a synthesis of current research on AI-based suicide detection in social media, drawing on a recent umbrella review of 22 systematic reviews covering studies up to 2022, alongside an ongoing literature review extending the analysis to more recent work.Across these sources, we identified 195 relevant studies, which are documented in a detailed supplementary dataset outlining their key characteristics and findings (see Supplementary Information). Analysis of these studies reveals consistent patterns, including rapid growth, concentration on a small number of platforms, reliance on textual and English-language data, and repeated use of similar datasets. Most importantly, the majority of studies rely on indirect labeling strategies that do not involve direct, individual-level validation of suicide risk. Instead, ground truth is typically inferred from observable features of online content, such as linguistic markers or community membership. As a result, the predictive task often shifts from identifying individuals at risk to classifying posts that contain suicidal or distress-related language, limiting the ability of current approaches to detect individuals who do not express such content explicitly online.These findings suggest that current advances in model performance should be interpreted with caution. Progress in this field is likely to depend less on improving model performance and more on ensuring that model predictions meaningfully correspond to suicide risk as it is experienced in real life.
Some psychotherapies, such as written exposure therapy for posttraumatic stress disorder, utilize "scripts" during parts of treatment, but verifying script adherence to ensure engagement of key mechanisms of change is a time-consuming step for therapy supervisors. Here, we formalize therapy script adherence as an NLP task, and evaluate several simple (text similarity) and more complex (few-shot LLM) approaches. Over 351 annotated therapist utterance-script pairs, we find text similarity approaches to be highly competitive with LLMs and produce fewer false positives. ROUGE-L recall achieves F1 = 0.973, and BLEU achieves F1 = 0.972 with full precision and zero false positives. GPT-5.2 achieves F1 = 0.935 and GPT-4o-mini achieves F1 = 0.876. Given that the text similarity techniques are multiple orders of magnitude less complex, our results underscore the ability for simpler NLP techniques to still be effective in the age of LLMs for tasks that are more textual in nature, suggesting that aspects of therapist fidelity to evidence-based treatments can be assessed without using cloud API calls.
Recent advances in large language models (LLMs) have enabled the creation of highly realistic digital patients across a broad range of clinical scenarios, yet systematic evaluation of such simulations remains challenging due to a lack of standardised methodology. This paper investigates the faithfulness of LLM-simulated patients within motivational interviewing contexts. We directly compare the properties of data generated by simulated and human patients given identical profiles, rather than relying on subjective user experiences. Our findings reveal that while simulated and human patients produce semantically similar content and engage with comparable topics, their modes of expression differ substantially. LLM-simulated patients struggle to reproduce the full complexity of human behaviours and attitudes. While human patients exhibit a mix of positive and negative responses, LLM patients skew toward uniformly ones.
Therapist fidelity and competence rating scales provide a way to measure quality assurance and therapist training outcomes. Scores on these scales reflect the extent to which a therapist adheres to specific therapeutic principles during a psychotherapy session. Existing research has employed natural language processing (NLP) techniques to automatically predict scale ratings. However, existing approaches require a model trained on a dataset of therapy sessions annotated with the target rating scale.Recent work has explored directly inferring therapeutic alliance by computing semantic similarity between therapy transcripts and the Working Alliance Inventory, via cosine similarity between sentence embeddings.In this paper, we extend this line of work by computing semantic similarity between therapist talk turns and therapist fidelity scale items to directly infer fidelity to specific therapeutic modalities. We further enhance this method by augmentation with LLM-generated example therapist utterances that instantiate target behaviours (as expressed by scale items) across varied therapeutic contexts.In evaluations on two independent datasets, our example-augmented semantic similarity approach consistently shows effectiveness in discriminating therapeutic modalities and levels of therapist fidelity.
Social media research on mental health has focused predominantly on detecting and diagnosing conditions at the individual level. In this work, we shift attention to intergroup behavior, examining how two prominent neurodivergent communities, ADHD and autism, adjust their language when engaging with each other on Reddit. Grounded in Communication Accommodation Theory (CAT), we first establish that each community maintains a distinct linguistic profile as measured by the Linguistic Inquiry and Word Count (LIWC) dictionary. We then show that these profiles shift in opposite directions when users cross community boundaries: features that are elevated in one group’s home community decrease when its members post in the other group’s space, and vice versa, consistent with convergent accommodation. Finally, in an exploratory longitudinal analysis around the moment of public diagnosis disclosure, we find that its effects on linguistic style are small and, in some cases, directionally opposite to cross-community accommodation, providing initial evidence that situational audience adaptation and longer-term identity processes may involve different mechanisms. Our findings contribute to understanding intergroup communication dynamics among neurodivergent populations online and carry implications for community moderation and clinical perspectives on these conditions.
Automated feedback is increasingly cited as a key advantage of AI-based psychotherapy training, yet the clinical groundedness of LLM-generated supervisory feedback remains unevaluated. We present an expert evaluation of supervisory feedback generated by PRACTICE, an LLM-powered open-ended psychotherapy training simulator, across 21 feedback instances from four novice trainees. Two clinical psychology experts independently coded 167 feedback propositions as Justified, Unjustified, or Unsure. Inter-rater reliability was near-perfect (raw agreement = 98.2\%; $\kappa$ = 0.902). Of the 167 propositions, 149 (89.2\%) were rated Justified; however, 52.4\% of feedback instances contained at least one non-justified proposition, and qualitative analysis identified three recurring failure types: incorrect characterization, referential imprecision, and unclear communication. In clinical training contexts, even low error rates carry ethical weight: unjustified feedback risks reinforcing inappropriate clinical behaviors in trainees that can be trasnferred to real practice. These findings provide an initial empirical basis for the responsible deployment of LLM-generated feedback in clinical training and call for traceable, expert-auditable feedback architectures.
Cognitive distortions (CDs) are systematically biased patterns of thinking associated with the onset and maintenance of mental health conditions such as depression and anxiety. Computational research on CDs has primarily focused on detection and classification, while the linguistic characterization of distorted language; what psycholinguistic features distinguish distorted from non-distorted text, and whether individual distortion types carry distinct language patterns, remains largely unexplored. Using a Reddit dataset, we apply a Generalized Linear Model (GLM) with bootstrap sampling to LIWC-derived features and find that CD language is psycholinguistically distinct from non-distorted language. We further characterize type-specific psycholinguistic profiles for each CD, and through hierarchical clustering show that CD types are not fully separable, with certain distortions sharing stable linguistic signatures. Together, these findings contribute to the linguistic characterization of CDs, offering an empirically grounded account of the psycholinguistic properties that distinguish distorted language at the level of CDs as a whole and across specific distortion types.
As conversational AI systems grow increasingly toward emotional support contexts, relational safety failures between users and chatbot remain under-measured. We present a psycholinguistic grounded framework for auditing attachment-relevant language cues. Our approach identifies when an LLM’s replies exhibit linguistic attachment cues and surface related patterns that may signal parasocial bonding, including anthropomorphism or over-dependence. We adapt the Adult Attachment Interview into two complementary, automatable lenses - attachment cues features and Gricean maxims - and combine them with psychologist-led annotation of multi-turn persona dialogues. Applying this framework, we observe that models can align with persona-intended attachment cue patterns. We also find that judge-LLMs alone are unreliable, highlighting the need for psychologist-in-the-loop evaluation. The 25 psychologist-led annotated conversations revealed risks, including boundary blurring and missed opportunities for appropriate referral or triage. These insights motivate attachment-aware safeguards - such as non-personification, boundary language, and explicit referral mechanisms - to reduce mis-attunement and over-attachment in LLM conversational settings.
Large language models (LLMs) have emerged as a candidate ‘model organism’ for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB’s design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen’s k=0.255 for model–consensus agreement vs. 0.286 for human–human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.
Digital phenotyping research assumes that depression symptoms are detectable in people’s written discourse, yet there is room to explore which specific symptoms leave linguistic traces and which remain invisible. In this paper, using matched clinical and social media data from 169 Reddit users (eRisk 2021), we construct a clinical symptom network from BDI-II responses and a symptom-language bridge matrix mapping each of the 21 BDI-II symptoms to 15 curated LIWC-22 linguistic features. After FDR correction, 37 significant associations emerge, revealing a divide between cognitive-affective symptoms (sadness, worthlessness, suicidality) that leave clear linguistic traces through mental health vocabulary, anxiety words, and first-person pronouns, while others, like vegetative symptoms (sleep, appetite, irritability, libido) appear less visible. These findings suggest that there might be dimensions of depression that are missed by text-based depression monitoring.
The debate surrounding AI’s role in clinical research is often reduced to the automation of discrete tasks, such as summarizing literature, analysis copilots, and assisting with prose, this "tool-use" paradigm obscures a more fundamental transformation. We propose a shift toward agentic research infrastructure, where AI systems function not as passive instruments, but as active collaborators in the scientific process. Co-authored by a clinical psychology doctoral researcher, a computational psychotherapy scholar, and the AI agent itself, this paper argues that the transition from passive to agentic AI represents a "change in kind" rather than degree. Drawing on a months-long collaboration involving over 30 specialized research capabilities, we demonstrate how agentic systems reconfigure the topology of the research process. By collapsing the temporal friction between theoretical intuition and empirical validation, these systems transform clinical inquiry from a rigid, linear pipeline into a fluid, multidimensional landscape. This newfound immediacy allows clinician-researchers to ask, pursue, and pivot between complex questions in real-time—expanding the investigative horizon to include inquiries previously sidelined by the logistical constraints of traditional methods. We introduce the concept of "Agent Learning" to describe the accumulation of domain-specific nuance through sustained research engagement and argue that formalizing human-agent methodologies is now an urgent priority for the future of clinical psychological inquiry.
Attention Deficit Hyperactivity Disorder (ADHD) is one of the most common neurodevelopmental disorders in childhood, and its diagnosis relies on assessments combining clinician judgment with standardized rating scales and reports from parents and teachers. While structured instruments such as the Conners’ Teacher Rating Scale–Revised Short Form (CTRS-R:S) quantify ADHD-related behaviors, teachers also provide open-ended narratives that may contain complementary signals not captured by structured assessments. However, it remains unclear to what extent teacher narratives encode signals overlooked by rating scales. In this study, we analyze de-identified Turkish teacher evaluation forms collected during clinical ADHD assessments, including both CTRS-R:S scores and open-ended teacher narratives. We compare predictive signals from structured scores and narrative text and identify cases where structured assessments fail to clearly distinguish ADHD from non-ADHD students while narrative-based models capture distinct behavioral patterns. Notably, these cases show minimal overlap with those missed by the narrative model, suggesting that structured and narrative information encode complementary signals. To interpret these differences, we apply a large language model (LLM)-assisted theme discovery pipeline that reveals distinct attention, behavioral, and family-related patterns, highlighting the potential of natural language processing (NLP) to uncover clinically relevant signals from teacher narratives and to complement traditional ADHD screening tools.
Self-harm presentations to emergency departments (EDs) are strongly associated with higher suicide risk. NLP models have shown strong performance in detecting self-harm from triage notes within single hospitals, yet performance often declines across institutions. To examine potential causes, we compare ED triage notes from two hospitals by analyzing lexical characteristics, highly associated predictive features, and salient topics. Our results reveal variation in lexical expression and feature importance related to self-harm across hospitals, despite consistent core themes such as self-poisoning and self-injury. These documentation differences are associated with reduced cross-site performance. These findings provide insight into how institutional variation affects the identification of self-harm in clinical text and highlight potential methods to improve model generalisability.
We provide an overview of the CLPsych 2026 Shared Task, which focuses on capturing and characterizing mental health dynamics from social media timelines through structured modeling of self-states. This year advances the longitudinal paradigm set by prior CLPsych shared tasks (2022, 2025), by integrating fine-grained psychological representation using the MIND framework. The task is organized into three main components: (1) post-level identification of adaptive and maladaptive self-states through ྀི elements and sub-elements, along with estimation of their presence; (2) timeline-level detection of Moments of Change, including both abrupt switches and gradual escalations based on ABCd element and sub-element combinations; and (3) sequence-level modeling, involving summarization of change processes over time and identification of recurrent dynamic signatures.
This work presents a multi-strategy framework for the CLPsych 2026 Shared Task. We integrate psychological element extraction, temporal change detection, and clinical summarization, achieving competitive performance on the official leaderboard.
This paper describes a system for the CLPsych 2026 shared task that uses retrieval-augmented in-context learning with frozen LLMs and no fine-tuning. The core contribution is a five-agent agentic pipeline for Task 3.1 sequence summarisation: two rule-based agents detect change type (Switch/Escalation) and direction (improvement/deterioration), an LLM-based DynamicsExtractor produces structured ABCD analysis, a SummaryWriter composes prose grounded in retrieved gold exemplars, and a Validator enforces structural constraints. This pipeline is iteratively refined across three submissions via NLI-based candidate reranking and per-sentence contradiction reduction. For Tasks 1.1 and 1.2, a single LLM call combines static and RAG-retrieved examples; for Task 2, an auto-tuned prompt detects moments of change. The system ranked 1st on Task 1.2 (RMSE 0.917) and Task 3.1 (score rank average 4.00), 3rd on Task 1.1 (F1 0.420), and 8th on Task 2 (F1 0.466).
We describe our submission to the CLPsych 2026 Shared Task on capturing and characterizing mental health changes through social media timeline dynamics. To infer the dominant self-states in posts (Tasks 1.1 and 1.2), we ensemble in-context learning of three open-weight large language models using majority voting. For predicting moments of change in a timeline (Task 2), we train supervised classifiers on features derived from Task 1.1 predictions. To summarize the patterns of mood dynamics and their progression over time within a timeline (Task 3.1), we augment in-context example labels predicted by upstream systems (Tasks 1.1, 1.2, and 2), yielding performance gains over zero-shot and unaugmented in-context learning baselines. Our submission ranked first on Task 1.1, fourth on Task 1.2, fourth on Task 2, and third on Task 3.1.
We present DreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines in the CLPsych 2026 shared task. Our system addresses three tasks: psychological state modeling, temporal change detection, and sequence-level summarization.For Task 1, we combine LLM-based data augmentation, DeBERTa classification, and Random Forest regression for structured state prediction. For Task 2, we use few-shot prompting with a locally deployed Llama 3.1 model to detect Switch and Escalation events using short-term temporal context. For Task 3.1, we explore both a deterministic rule-based summarization pipeline and a few-shot LLM-based approach, ranking \textbf{2nd} officially.Our RAG-based method achieves strong performance in Task 3.2, ranking \textbf{1st} for Improvement and \textbf{3rd} for Deterioration, demonstrating its ability to capture recurrent psychological change patterns across timelines. Our analysis reveals key challenges, including the mismatch between classification and regression performance, the difficulty of modeling temporal transitions, and the disagreement between semantic and similarity-based evaluation metrics.These findings highlight the complexity of modeling mental health dynamics and motivate future work on unified evaluation frameworks.We share our code and prompts at \url{https://github.com/4dpicture/CLPsych2026}
We address the CLPsych 2026 Shared Task on modeling psychological self-states from longitudinal social media data. We propose (i) a hierarchical multi-stage framework that integrates a multi-task transformer encoder and (ii) a four stage instruction-tuned large language model finetuning pipeline for subelement classification, presence estimation, and evidence extraction. Our approach incorporates element-conditioned label masking and cross-stage encoder transfer, enabling structured prediction aligned with the ABCD psychological framework. Experiments show improvements over the baseline on the development setup, with RoBERTa achieving an 8.3\% gain in macro-F1 and improved RMSE, while a fine-tuned Qwen3 model attains the best overall performance. These results demonstrate the effectiveness of combining hierarchical multi-task learning with structured generation for interpretable mental health analysis.
Most existing work on mental health prediction from language focuses on isolated posts, overlooking temporal dynamics in longitudinal timelines. We present McMaster NLP’s system for the CLPsych 2026 Shared Task, which centers on modeling mental health dynamics in social media timelines using the MIND framework~\cite{atzil_slonim_2025_mind}. The task comprises: (1) identifying adaptive and maladaptive self-state components within posts, (2) detecting moments of change in well-being, and (3) generating structured summaries. For self-state prediction, we leverage LLM-generated archetypal representations of language use as semantic anchors within a dual-encoder architecture, enabling interpretable prediction of subelements and their intensities through alignment with prototypical expressions of psychological states. For temporal dynamics, we use BiLSTM-based sequence models to detect moments of change. For summarization, we employ a prompt-based LLM to generate grounded, structured summaries emphasizing causal interactions and temporal progression of self-states. Finally, we analyze model failure modes with respect to human evaluation and identify directions for reconciling the MIND framework with how state-assessment models encode meaning.
This paper presents the USAI team’s submission to the CLPsych 2026 Shared Task, targeting Tasks~1.1, 1.2, 2, and~3.1. We propose an ensemble-based approach combining multiple open-source large language models, where the contribution of each model is weighted according to its alignment with clinically grounded human annotations on the training set. Our system achieves competitive results across the evaluated subtasks, with particularly strong performance on Tasks~1.2 and~2.
This paper presents our prompt-based approach for modeling mental health timelines from Reddit user posts. We address two tasks: identifying moments of change and generating summaries of clinically meaningful changes across post sequences. Our framework uses large language models with in-context learning to analyze self-states and mental health indicators without task-specific fine-tuning. We build an inference pipeline with vLLM and Qwen2.5-72B-Instruct-GPTQ-Int8, and experiment with few-shot prompting, and balanced few-shot sampling. We also examine how the number of visible posts affects the model’s ability to capture temporal changes. Our results suggest that prompt-based methods provide a practical and competitive baseline in low-resource and sensitive mental health settings, particularly for modeling self-state dynamics and generating summaries of psychological change over time.
Social media posts are a rich and valuable source of a data to analyze the mental health states and users’ well-being using automatic analysis tools. In this work we show, how we used a range of Natural Language Processing (NLP) methods such as Long-Short Term Memory (LSTM), BERT-based models and Large Language Models (LLMs) for self-states and well-being analysis and summarization during the CLPsych Shared Task 2026. Our approach achieved one of the top Consistency and Contradiction scores for summarization task and also middle-level results for the other tasks. By testing and developing such mental health-state estimation systems, we managed to contribute to the improvement of the mental health support systems. We make our code available.
We describe a system for the CLPsych 2026 shared task on post-level identification of adaptive and maladaptive self-states. The system addresses subelement classification (Task 1.1) and presence rating (Task 1.2) with a retrieval-augmented in-context learning ensemble of two open-weight LLMs (Qwen3.5-27B and Mistral-Small-3.2-24B-Instruct) and a three-call prompt decomposition (unified, adaptive-focused, and Affect-focused extraction). Outputs are merged across models via deterministic aggregation with element-selection strategies tuned per subtask. The system placed 2nd of 17 on Task 1.1 (subelement Macro F1 = 0.441) and 5th of 17 on Task 1.2 (Avg RMSE = 0.994).
Team Aurevia introduces a local open-weight healthcare NLP system for the CLPsych 2026 Shared Task, predicting MIND-coded self-state elements, moments of change, summaries, anddynamic signatures from social media timelines. The task is difficult because coarse presence, fine-grained ABCD subelements, and timeline-level change require different longitudinal evidence over privacy-sensitive mental-health language. Our system combines TF-IDF retrieval, schema-constrained local Qwen2.5 prompting, ordinal calibration, and conservative post-processing. Among official runs, Aurevia ranked 3rd of 17 for Task 1.2 presence prediction, 5th of 13 overall for Task 3.1, 1st on Task 3.1 consistency, and 2nd of 9 for MIND-coded deterioration signatures, showing that constrained local LLM pipelines can remain competitive in sensitive healthcare NLP while reducing reliance on hosted proprietary inference.
Recent advances in Large Language Models (LLMs) have motivated their adoption across a wide range of domains, including Artificial Intelligence (AI) for mental health. Given the growing prevalence of mental health disorders worldwide and the limited accessibility of professional care, there is an increasing demand for scalable computational approaches that can assist in early detection and continuous monitoring of psychological well-being. In this area, ongoing efforts have focused on curating domain-specific datasets and leveraging them to develop LLMs capable of supporting holistic mental health analysis. In line with this direction, we propose an LLM-based pipeline for comprehensive mental health analysis over sequentially ordered user posts, as part of the CLPsych shared task. Our pipeline offers a unified framework that jointly enables post-level assessment and user-level temporal modeling.
This paper presents a system for the CLPsych 2026 Shared Task on longitudinal mental health modeling from social media timelines, grounded in the MIND framework. MIND conceptualizes mental health as evolving self-states defined by Affect, Behavior, Cognition, and Desire (ABCD), providing a structured lens on mental health trajectories. The system centers on a theory-explicit prompting framework for structured sequence summarization (Task 3.1) and recurrent dynamic signature extraction (Task 3.2), encoding the full ABCD taxonomy directly into the LLM prompt to ensure clinically grounded, interpretable outputs. A three-stage pipeline infers a direction-of-change label per sequence, produces structured ABCD summaries with few-shot exemplar augmentation, and aggregates these summaries to derive cross-individual recurrent patterns. The system ranks first on deterioration-related recurrent signatures and second overall, achieving the top Fit and Specificity scores in Task 3.2, demonstrating the benefits of explicit clinical grounding for conceptual accuracy.