Suhas Bn


2025

This paper investigates the capacity of small language models (0.5B-5B parameters) to generate empathetic responses for individuals with PTSD. We introduce Trauma-Informed Dialogue for Empathy (TIDE), a novel dataset comprising 10,000 two-turn conversations across 500 diverse, clinically-grounded PTSD personas (https://huggingface.co/datasets/yenopoya/TIDE). Using frontier model outputs as ground truth, we evaluate eight small LLMs in zero-shot settings and after fine-tuning. Fine-tuning enhances empathetic capabilities, improving cosine similarity and perceived empathy, although gains vary across emotional scenarios and smaller models exhibit a “knowledge transfer ceiling.” As expected, Claude Sonnet 3.5 consistently outperforms all models, but surprisingly, the smaller models often approach human-rated empathy levels. Demographic analyses showed that older adults favored responses that validated distress before offering support (p = .004), while graduate-educated users preferred emotionally layered replies in specific scenarios. Gender-based differences were minimal (p > 0.15), suggesting the feasibility of broadly empathetic model designs. This work offers insights into building resource-efficient, emotionally intelligent systems for mental health support.
Synthetic data adoption in healthcare is driven by privacy concerns, data access limitations, and high annotation costs. We explore synthetic Prolonged Exposure (PE) therapy conversations for PTSD as a scalable alternative for training clinical models. We systematically compare real and synthetic dialogues using linguistic, structural, and protocol-specific metrics like turn-taking and treatment fidelity. We introduce and evaluate PE-specific metrics, offering a novel framework for assessing clinical fidelity beyond surface fluency. Our findings show that while synthetic data successfully mitigates data scarcity and protects privacy, capturing the most subtle therapeutic dynamics remains a complex challenge. Synthetic dialogues successfully replicate key linguistic features of real conversations, for instance, achieving a similar Readability Score (89.2 vs. 88.1), while showing differences in some key fidelity markers like distress monitoring. This comparison highlights the need for fidelity-aware metrics that go beyond surface fluency to identify clinically significant nuances. Our model-agnostic framework is a critical tool for developers and clinicians to benchmark generative model fidelity before deployment in sensitive applications. Our findings help clarify where synthetic data can effectively complement real-world datasets, while also identifying areas for future refinement.