Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)

Vera Danilova, Murathan Kurfalı, Ylva Söderfeldt, Julia Reed, Andrew Burchell (Editors)


Anthology ID:
2026.healing-1
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
HeaLing | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://preview.aclanthology.org/ingest-eacl/2026.healing-1/
DOI:
ISBN:
979-8-89176-367-8
Bib Export formats:
BibTeX

Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.
Large language models (LLMs) increasingly exhibit sycophancy—the tendency to conform to user beliefs rather than provide factually accurate information—posing significant risks in healthcare applications where reliability is paramount. We evaluate sycophantic behavior in ten LLMs from OpenAI, Google, and Anthropic across multi-turn medical conversations using an escalatory pushback framework. To enable fine-grained analysis, we introduce Resistance, a metric that measures nonconformity to user stances at each conversational turn, providing insights beyond existing flip-based metrics. Evaluating on MedCaseReasoning (open-ended diagnostic questions) and PubMedQA (clear-answer biomedical questions), we find that Gemini models exhibit the highest Resistance, followed by OpenAI and Claude models. We further observe that response patterns ("Yes, but..." vs. "Yes, and...") may be more predictive of sycophancy than specific phrases. Notably, all models are more easily persuaded to change their answers on clear multiple-choice questions than on ambiguous diagnostic cases. Our findings highlight critical vulnerabilities in deploying LLMs for clinical decision support and suggest that training toward contradiction-maintaining response patterns may serve as a potential mitigation strategy.
This study assesses the communicative effectiveness of Italian HPV vaccination campaign materials using a mixed-methods design that combines expert annotation and a public perception experiment. A corpus of 49 official documents was annotated by six experts (three Linguistics Ph.D. students and three Gynecology residents) across 56 variables capturing the appropriateness and efficiency of verbal and visual elements. The perception experiment, administered to a convenience sample of Italian general public, examined attitudes toward HPV vaccination and evaluations of communication effectiveness. Overall, both expert and public assessments converged in judging the HPV vaccination campaign materials as relatively weak, citing reduced informativeness in overly concise texts, inappropriate choice of colors, and recurring issues regarding gender representation, inclusivity, and diversity.
Extracting medical decisions from clinical notes is a key step for clinical decision support and patient-facing care summaries. We study how the linguistic characteristics of clinical decisions vary across decision categories and whether these differences explain extraction failures. Using MedDec discharge summaries annotated with decision categories from the Decision Identification and Classification Taxonomy for Use in Medicine (DICTUM), we compute seven linguistic indices for each decision span and analyze span-level extraction recall of a standard transformer model. We find clear category-specific signatures: drug-related and problem-defining decisions are entity-dense and telegraphic, whereas advice and precaution decisions contain more narrative, with higher stopword and pronoun proportions and more frequent hedging and negation cues. On the validation split, exact-match recall is 48%, with large gaps across linguistic strata: recall drops from 58% to 24% from the lowest to highest stopword-proportion bins, and spans containing hedging or negation cues are less likely to be recovered. Under a relaxed overlap-based match criterion, recall increases to 71%, indicating that many errors are span boundary disagreements rather than complete misses. Overall, narrative-style spans–common in advice and precaution decisions–are a consistent blind spot under exact matching, suggesting that downstream systems should incorporate boundary-tolerant evaluation and extraction strategies for clinical decisions.
We introduce Semantic Echo Pathways (SEP), a new approach for modeling the cross-domain evolution of medical language. Using continual neural topic models (CoNTM) trained separately on scientific literature, clinical notes, and public health-related data, we track linguistic drift and identify points where concepts change meaning. We propose three novel metrics: Cross-Domain Drift Score, Temporal Echo Lag, and Semantic Mutation Patterns to quantify how medical language travels between the scientific, clinical, and public domain. Applications to evolving concepts such as "long COVID", diagnostic category changes reveal previously undocumented patterns of medical-semantic evolution. Our results bridge computational modeling with the human-centered perspectives of medical humanities, offering clear, domain-aware maps of how medical language shifts across time and domains, and combining quantitative analysis with linguistic and clinical insight.
The increasing frequency of foodborne illnesses, safety hazards, and disease outbreaks in the food supply chain demands urgent attention to protect public health. These incidents, ranging from contamination to intentional adulteration of food and feed, pose serious risks to consumers, leading to poisoning, and disease outbreaks that lead to product recalls. Identifying and tracking the sources and pathways of contamination is essential for timely intervention and prevention. This paper explores the use of social media and regulatory news reports to detect food safety issues and disease outbreaks. We present an automated approach leveraging a multi-task sequence labeling and sequence classification model that uses a liquid time-constant neural network augmented with a graph convolution network to extract and analyze relevant information from social media posts and official reports. Our methodology includes the creation of annotated datasets of social media content and regulatory documents, enabling the model to identify foodborne infections and safety hazards in real-time. Preliminary results demonstrate that our model outperforms baseline models, including advanced large language models like LLAMA-3 and Mistral-7B, in terms of accuracy and efficiency. The integration of liquid neural networks significantly reduces computational and memory requirements, achieving superior performance with just 1.2 × e6 bytes of memory, compared to the 20.3 GB of GPU memory needed by traditional transformer-based models. This approach offers a promising solution for leveraging social media data in monitoring and mitigating food safety risks and public health threats.
Multimodal Artificial Intelligence (AI) promises to transform biomedicine by integrating imaging, genomics, and clinical data for superior decision-making. Yet, we contend that the current pursuit of large-scale generalist models is fundamentally misaligned with the high-risk nature of biomedical applications. This position paper argues that biomedical NLP demands specialization, not generalization, challenging the assumption that greater model scale and generality inherently ensure robustness in healthcare. We propose a theoretical framework built on three biomedical axioms: error cost asymmetry, multimodal data fragility, and interpretability–utility coupling, alongside a formal proof of criticality in biomedical NLP, showing that generalist models are intrinsically unsuited for medical tasks. As a secondary contribution, we advance a task-first design paradigm centered on modular, specialized, and ethically grounded AI architectures for biomedical use. Through analysis and illustrative cases, we contrast this approach with scale-centric strategies, exposing risks such as bias amplification, reduced interpretability, and exclusion of rare or underrepresented populations. We call for a realignment of research, funding, and regulation toward specialization as the sustainable path for meaningful and equitable biomedical AI, aiming to spark critical discourse on what constitutes genuine progress in machine learning for health.
Entity linking in biomedicine typically relies on large annotated corpora and supervised methods, which often fail in out-of-distribution settings. Historical medical texts are rich in biomedical terms but pose unique challenges: terminology has changed, some concepts are obsolete, and stylistic differences from modern journals prevent off-the-shelf models fine-tuned on contemporary datasets from aligning historical terms with current ontologies. Training-free methods based on LLMs offer a solution by linking historical terms to modern concepts and inferring their meaning from context. In this paper, we evaluate a state-of-the-art training-free entity linking method on historical medical texts and propose an improved pipeline—end-to-end entity extraction and linking with confidence estimation. We also assess performance on modern benchmarks to check whether the gains generalize to other domains and show their superior performance in most cases. We report an analysis of the findings. The code and curated dataset for historical medical entity linking are available on GitHub.
Social media platforms have become critical sources of patient-generated health data, yet existing computational approaches fail to capture the interconnected nature of online health discourse. We present a novel framework that integrates graph-based community detection with large language model analysis to understand patient narratives in multimodal social media content. Applied to 10,253 TikTok posts about JAK inhibitors (2020-2024), our approach constructs heterogeneous graphs representing user-content-medical entity relationships and applies community detection algorithms enhanced with context-aware LLM interpretation. Our comprehensive analysis of 10,253 posts (January 2020–September 2024) reveals five distinct patient communities characterized by different discourse patterns: treatment success narratives (873 nodes), medication guidance (642 nodes), side effect discussions (589 nodes), comparative treatment analysis (412 nodes), and dosage optimization (347 nodes). The Louvain algorithm significantly outperformed Girvan-Newman in modularity (0.9931 vs. 0.9928), conductance (0.0002 vs. 0.0006), and computational efficiency (0.14s vs. 54.24s). Temporal analysis demonstrates increasing community cohesion and evolving discourse patterns from cautious inquiry (2020-2021) to experience sharing and specialized sub-communities (2023-2024). This work contributes: (1) a scalable computational framework for multimodal health content analysis, (2) methodological innovations in graph-LLM integration, and (3) insights into platform-specific health communication patterns. The framework has applications in pharmacovigilance, computational social science, and AI-assisted health monitoring systems.
This study evaluates the linguistic and clinical suitability of synthetic electronic health records in mental health. First, we describe the rationale and the methodology for creating the synthetic corpus. Second, we examine expressions of agency, modality, and information flow across four clinical genres (Assessments, Correspondence, Referrals and Care plans) with the aim to understand how LLMs grammatically construct medical authority and patient agency through linguistic choices. While LLMs produce coherent, terminology-appropriate texts that approximate clinical practice, systematic divergences remain, including registerial shifts, insufficient clinical specificity, and inaccuracies in medication use and diagnostic procedures. The results show both the potential and limitations of synthetic corpora for enabling large-scale linguistic research otherwise impossible with genuine patient records.
Few-shot prompting with Large Language Models (LLMs) has emerged as a promising paradigm for advancing information extraction, particularly in data-scarce domains like biomedicine, where high annotation costs constrain the availability of training data.However, challenges persist in biomedical Named Entity Recognition (NER), where LLMs fail to achieve necessary accuracy and lag behind supervised fine-tuned models. In this study, we introduce FETA (First Extract, Tag Afterwards), a two-stage approach for entity recognition that combines instruction-guided prompting and a novel self-verification strategy to improve accuracy and reliability of LLM predictions in domain-specific NER tasks. FETA achieves state-of-the-art results on multiple established biomedical datasets.Our experiments demonstrate that carefully designed prompts, using self-verification and instruction guidance, can steer general-purpose LLMs to outperform fine-tuned models in knowledge-intensive NER tasks, unlocking their potential for more reliable and accurate information extraction in resource-constrained settings.
Automatic evaluation of open-ended question answering in specialized domains remains challenging mainly because it relies on manual annotations from domain experts. In this work, we assess the ability of several large language models (LLMs), including closed-access (GPT-5.1, Gemini-2.5-Pro), open-source general-purpose (Qwen-80B), and biomedical domain-adapted models (MedGemma-27B, Phi-3.5-mini variants), to act as automatic evaluators of semantic equivalence in French medical open-ended QA. Our analysis reveals that LLM-based judgments are sensitive to the source of answer generation: judgement correlation varies substantially across different generator models. Among the judges, MedGemma-27B and Qwen-80B achieve the highest agreement with expert annotations in terms of F1 score and Pearson correlation. We further explore lightweight adaptation strategies on Phi-3.5-mini using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). Even with 184 training instances, these adaptations significantly improve Phi-3.5’s results and reduce variability across answer generators, achieving performance comparable to larger domain-adapted models. Our results highlight the importance of generator-aware evaluation, the limitations of general-purpose LLMs in domain-specific settings, and the effectiveness of lightweight adaptation for compact models in low-resource scenarios.
In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic & English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis shows that model-reported confidence and explanations are poor indicators of correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
Large language models (LLMs) often default to single-label classification in zero-shot multi-label tasks—a tendency we term "conservative default". While few-shot prompting mitigates this, it introduces "example bias". We evaluate zero-shot strategies to modulate this tendency using 1,441 healthcare feedback records and two LLMs. We compare instruction-based methods with structural constraints that modify the token generation sequence, specifically an Enum-First format requiring domain enumeration before selection. Results show that structural constraints substantially reduce single-label rates (Magistral: 96% → 19%; Qwen3: 54% → 0.0%), though the latter suggests potential over-correction compared to human baselines (16.7–41.3%). These findings indicate that while output structure is a potent modulator of classification behavior by shifting the decision point upstream, its effect magnitude is model-dependent, necessitating empirical calibration to prevent spurious associations.
Accurate normalization of health-related expressions to standardized biomedical concepts is crucial for both healthcare and biomedical research. However, traditional string-based matching methods are limited by lexical variations. In this study, we propose a neural embedding-based normalization framework that utilizes an embedding model trained on biomedical terminology, generating over 3.59 million embeddings corresponding to UMLS terms and Concept Unique Identifiers (CUIs). For clinical data, CUIs were retrieved via semantic matching, while Twitter phrases were first processed using a large language model (LLM) to generate preferred terms prior to embedding-based CUI retrieval. Our approach substantially outperforms exact string matching and MetaMap Lite. For clinical data (3,144 phrases), normalization accuracy improved from 0.679 (string match) and 0.574 (MetaMap Lite) to 0.858. For Twitter data (102 phrases), accuracy increased from 0.235 (string match) and 0.118 (MetaMap Lite) to a range of 0.882 (Gemini 2.5 Flash) to 0.980 (GPT-4o mini). These findings highlight both the effectiveness of embedding-based semantic retrieval and the ability of LLMs to generate preferred terms, enhancing robustness in health concept normalization across diverse text sources.
This paper describes a new dataset for aspect-based sentiment analysis (ABSA) for analyzing patient feedback about healthcare services. In an interdisciplinary collaboration spanning the fields of natural language processing and healthcare research, we manually annotate a dataset of 2382 free-text comments collected from national patient experience surveys in Norway, covering two sub-fields of services – special mental healthcare and general practitioners. Annotations are provided on both the sentence- and comment-level, covering a fine-grained set of 25 unique healthcare-related aspects and their polarities. We also report results for fine-tuning both encoder- and decoder models on the resulting dataset, comparing different modeling strategies, like joint and sequential prediction of aspects and polarity. The resources developed in this work can assist healthcare researchers in the analysis of patient feedback, bringing a much more efficient approach compared to today’s manual analysis, potentially leading to improved patient satisfaction and clinical outcomes.
Inspired by recent plug-in frameworks that repurpose frozen layers from large language models (LLMs) as inductive priors, we explore whether such mechanisms can be extended to clinical time-series prediction without textual inputs or LLM fine-tuning. We introduce a lightweight plug-in architecture that inserts a single frozen LLM Transformer layer between an aggregated time-series representation and the prediction head. Unlike prior work focused on vision or language tasks, our study targets clinical time-series data, where LLMs typically underperform when applied directly.Experiments on two ICU prediction tasks from MIMIC-III show that the proposed plug-in exhibits heterogeneous effects across different backbones and tasks, with occasional performance improvements and minimal computational overhead. We further compare general-purpose and medical-domain LLM layers under an identical plug-in setting, analyzing how domain specialization interacts with clinical time-series models. Overall, our results highlight important limitations of frozen LLM plug-ins and motivate future work on understanding the conditions under which such layers may be beneficial.
Public awareness of Autism Spectrum Disorder (ASD) has grown in recent years, yet stigma surrounding this condition persists. Building on prior research showing increasingly positive portrayals of ASD, this study examines recent longitudinal trends in stigma and ASD, with a focus on Italian newspapers, and how these were affected by a key event such as the COVID-19 pandemic. We analyzed nearly 3,000 articles published between 2016 and 2025 using an innovative multi-layered Natural Language Processing (NLP) framework to capture multiple dimensions of stigma, including discriminatory language, emotional framings indicative of prejudices, stereotypes, and the thematic contexts in which ASD-related stigma appears. Overall, results indicate low levels of overt stigma and a gradual shift toward more positive portrayals, with only temporary disruptions during the pandemic. Some stereotypes remain, highlighting the need for ongoing attention to ASD representation in the media.
This paper presents an LLM-driven approach for constructing diverse social media datasets to measure and compare loneliness in the caregiver and non-caregiver populations. We introduce an expert-developed loneliness evaluation framework and an expert-informed typology for categorizing causes of loneliness for analyzing social media text. Using a human-validated data processing pipeline, we apply GPT-4o, GPT-5-nano, and GPT-5 to build a high-quality Reddit corpus and analyze loneliness across both populations. The loneliness evaluation framework achieved average accuracies of 76.09% and 79.78% for caregivers and non-caregivers, respectively. The cause categorization framework achieved micro-aggregate F1 scores of 0.825 and 0.80 for caregivers and non-caregivers, respectively. Across populations, we observe substantial differences in the distribution of types of causes of loneliness. Caregivers’ loneliness were predominantly linked to caregiving roles, identity recognition, and feelings of abandonment, indicating distinct loneliness experiences between the two groups. Demographic extraction further demonstrates the viability of Reddit for building a diverse caregiver loneliness dataset. Overall, this work establishes an LLM-based pipeline for creating high quality social media datasets for studying loneliness and demonstrates its effectiveness in analyzing population-level differences in the manifestation of loneliness.
Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open-source language models. We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical doctors and achieving state-of-the-art performance on the MEDEC benchmark dataset. Code available on GitHub: https://github.com/CraigMyles/clinical-note-error-detection
Neuropsychiatric lupus (NPSLE) is characterized by inflammation in the brain with common symptoms of depression and anxiety. Early detection is crucial as it may change the treatment regimen; however, current approaches are costly and resource intensive. Therefore, we propose that leveraging current work using linguistics in NLP detection of mental health symptoms can be advantageous in early detection of NPSLE. This study is a proof-of-concept using 20 interviews from N=20 adolescents (10-17 years) diagnosed with Lupus. Our results suggest that linguistic feature-based models supported by Word2Vec embeddings offer an interpretable output compared with BERT models, while maintaining competitiveness in depression, and improvement over BERT in anxiety detection. This work may transform early screening methods in paediatric contexts and can be adapted to other clinical populations.
Automatic classification of aphasia severity presents persistent challenges, particularly for languages with limited clinical speech resources such as Russian. This paper explores a multimodal approach to severity estimation that combines acoustic and semantic representations of pathological speech. Acoustic features are extracted using pretrained Wav2Vec 2.0 models, while semantic information is obtained from the encoder of the Whisper model. The two representations are integrated via early feature fusion and evaluated using gradient boosting classifiers in a speaker-independent cross-validation setting. Experiments are conducted on a newly collected dataset of Russian speech recordings from patients with aphasia and neurotypical speakers (RuAphasiaBank). The results suggest that the combined use of acoustic and semantic embeddings can provide more stable severity estimates than unimodal baselines. This study contributes empirical evidence on the applicability of multimodal representation learning for aphasia severity classification under data-scarce conditions.
This study focuses on improving the performance of language models for two critical applications within the One Health context, specifically in epidemiological monitoring using textual data: (i) thematic classification across syndromic surveillance, biomedical and plant health domains, and (ii) detection of epidemic misinformation. A key challenge in these tasks is the limited availability of labeled textual data, which constrains the effectiveness of supervised learning methods. To overcome this limitation, we introduce two families of selective masking–based data augmentation strategies: lexical and non-lexical. Each family is implemented in a standard variant (Aug-SM-Lex and Aug-SM-NonLex), and a TF-IDF-weighted variant (Aug-SM-Lex-TFIDF and Aug-SM-NonLex-TFIDF). We perform two complementary experiments: the first determines the optimal masking rate, while the second evaluates the proposed strategies against LLM-based text reformulation. Experimental results indicate that selective masking-based augmentation outperformed both LLM-based reformulation (Mistral-7B and GPT-Neo-1.3B) and baseline models trained on original data alone across three of the five evaluated datasets, with the best performance achieved at a masking rate of 20%. This suggests that selective masking is a promising approach, potentially more effective than computationally expensive LLM-based reformulation.
This paper presents an annotation scheme developed to analyze linguisticaccessibility and inclusivity in Italian cancer-related informational materials.The scheme combines metadata annotation, qualitative analysis of textual andvisual features, and automatically extracted measures of linguistic complexitycapturing structural, lexical, and probabilistic properties of the texts. Abrief case study demonstrates how the proposed framework can be applied tocompare documents and identify different sources of linguistic difficulty. Theapproach provides a replicable methodological basis for large-scale analyses ofhealth communication materials.
As healthcare services deploy AI to automate patient-facing communication, concerns persist about the interactional work through which empathy is made relevant. We examine empathy not as an internal state but as an interactional accomplishment, asking how patients display orientations to an LLM-powered voice assistant’s turns as (non-)empathic in real clinical telephone calls. Using Conversation Analysis (CA) to analyse post–cataract surgery follow-up calls conducted by AI-powered voice assistant Dora (Ufonia), we compare patient responses across earlier and later system versions.Earlier calls show minimal, delayed, prosodically closed responses to wellbeing enquiries, consistent with treating Dora as a transactional information-gathering device. Later calls more often feature socially rich formats, for example colloquial upgrades, gratitude tokens, occasional return enquiries, and increased turn-final rising intonation, suggesting patients hear Dora’s talk as socially implicative and thus opening space for affiliative/empathetic uptake. We discuss implications for CA-informed conversation design and for evaluating “empathy” via participant orientations in situ rather than post-hoc self-report.
This study provides corpus-based evidence that English-speaking children with hearing loss (CHL) show both quantitative and qualitative delays in wh-question development compared to typically developing (TD) peers. Using Natural Language Processing (NLP)/Large Language Model (LLM) based methods and two clinical subcorpora from CHILDES, we analyzed child utterances across several syntactic dimensions: frequency, lexical diversity, structural completeness, clausal embedding, wh-fronting, and utterance length. CHL produced significantly fewer wh-questions, used a narrower range of wh-types, showed lower rates of embedding, and more structural incompleteness. These differences were most evident in syntactically complex forms, such as embedded and canonical fronted wh-questions. The results support input-sensitive and usage-based accounts of syntactic development and highlight the need for enriched linguistic input in supporting CHL’s grammatical growth. Importantly, these group differences persisted when controlling for overalllanguage development as indexed by mean length of utterance (MLU) in words, indicatingthat CHL’s difficulties with wh-questions are not reducible to generalgrammatical delay.Methodologically, the study combines dependency-parsing-based analyses with exploratory LLM evaluation to assess the feasibility and limits of automated approaches to spontaneous child language. NLP-based analyses were more stable for formally defined syntactic features, while GPT-based analysis showed mixed performance, performing better on global structural judgments than on fine-grained syntactic diagnostics.
The prevalence of chronic stress represents a major public health concern, yet automated detection of vulnerable individuals remains limited. Social media platforms like X (formerly Twitter) serve as important venues for people to share their experiences openly. This paper introduces StressRoBERTa, a cross-condition transfer learning approach for the automatic detection of self-reported chronic stress in English tweets. We investigate whether continual pretraining on clinically related conditions, such as depression, anxiety, and PTSD, which have a high comorbidity with chronic stress, improves stress detection compared to general language models. We continually pretrained RoBERTa on the Stress-SMHD corpus, a subset of Self-reported Mental Health Diagnoses focused on stress-related conditions, consisting of 108 million words from users with self-reported diagnoses of depression, anxiety, and PTSD. Then, we fine-tuned on the SMM4H 2022 Shared Task 8. StressRoBERTa achieves 82% F1, which outperforms the best shared task system (79% F1) by 3 percentage points. Our results demonstrate that focused cross-condition transfer learning from stress-related disorders provides stronger representations than general mental health training. To validate cross-condition generalization, we also fine-tuned the model on the Dreaddit. Our result of 81% F1 further demonstrates the transfer from clinical mental health contexts to situational stress discussions.
We present DementiaBank-Emotion, the first multi-rater emotion annotation corpus for Alzheimer’s disease (AD) speech. Annotating 1,492 utterances from 108 speakers for Ekman’s six basic emotions and neutral, we find that AD patients express significantly more non-neutral emotions (16.9%) than healthy controls (5.7%; p < .001). Exploratory acoustic analysis suggests a possible dissociation: control speakers showed substantial F0 modulation for sadness (Delta = -3.45 semitones from baseline), whereas AD speakers showed minimal change (Delta = +0.11 semitones; interaction p = .023), though this finding is based on limited samples (sadness: n=5 control, n=15 AD) and requires replication. Within AD speech, loudness differentiates emotion categories, indicating partially preserved emotion-prosody mappings. We release the corpus, annotation guidelines, and calibration workshop materials to support research on emotion recognition in clinical populations.