Mike Conway


2026

We present two systems for the #SMM4H-HeaRD 2026 Task 2 shared task of automated insomnia detection from clinical notes. Our system addresses both subtasks: (1) binary insomnia classification and (2) multi-label rule prediction with evidence span extraction. For Subtask 1, we employ an ensemble architecture combining Qwen3-4B-Instruct and Bio_ClinicalBERT to capture both general semantic reasoning and domain-specific clinical representations. The framework utilizes chunk-based processing with overlapping token windows to handle long clinical notes efficiently. For Subtask 2, we develop a dual-head multi-task transformer model that jointly predicts insomnia labels and token-level evidence spans using BIO tagging. To improve clinical relevance, we additionally incorporate sentence-level filtering using sentence-transformer embeddings and similarity-based retrieval of insomnia-related contexts. Experimental results suggest competitive performance relative to the shared task mean and median scores across both subtasks. Our best Subtask 1 system achieves a recall of 0.9474, surpassing the shared-task mean and median recall, while our Subtask 2 system exceeds the mean and median scores across label classification, exact match, and partial match metrics. The end-to-end implementation is publicly available on GitHub.
Suicide memes are memes used to express suicide-related thoughts or comment on suicide-related issues. Suicide memes are increasingly common on social media, yet remain poorly understood and potentially harmful. There is an urgent need to better understand their characteristics and to develop appropriate content moderation strategies that limits users’ exposure to potentially harmful content. Currently, the absence of annotated datasets of suicide memes remains a key barrier to developing and evaluating automated moderation approaches. In this paper, we introduce FigSIM, the first dataset designed for fine-grained analysis of suicide memes. The dataset consists of 1049 memes, each annotated for (1) fine-grained suicide severity levels, (2) figurative phenomena (e.g. metaphors), and (3) suicide-related content (e.g. suicide method depiction). We benchmark 16 unimodal and multimodal models across three tasks: figurative language, suicide severity, and suicide-related content detection. Overall, FigSIM demonstrates that suicide memes pose unique challenges for both modeling and content moderation. Analysis revealed biases, such as underprediction of higher suicide severity levels, especially for figurative memes.
Large Language Models (LLMs) show promise in medical Question-Answering (QA) but suffer from hallucinations that jeopardize patient safety. While Retrieval-Augmented Generation (RAG) mitigates this by grounding outputs in external evidence, existing pipelines struggle with the complex, rapidly evolving nature of oncology. We present **CoMeta**, a three-level controllable metadata-aware framework optimized for Cancer Patient QA (CPQA). We introduce Clinical Hybrid Semantic-Symbolic Document Retrieval (CHSDR), which synergizes real-time Boolean search via NCBI E-Utilities with semantic retrieval to overcome metadata blindness. Additionally, we propose Semantic Enhanced Overlap Segmentation (SEOS) to prevent contextual fragmentation. Our results demonstrate that CHSDR significantly improves retrieval performance, CoMeta improved the answer accuracy of Claude-3-haiku by 5.24% over chain-of-thought prompting and about 3% over a naive RAG setup. This study highlights the importance of domain-specific query optimization in realizing the full potential of RAG and provides a robust framework for building more reliable CPQA systems.
We address the CLPsych 2026 Shared Task on modeling psychological self-states from longitudinal social media data. We propose (i) a hierarchical multi-stage framework that integrates a multi-task transformer encoder and (ii) a four stage instruction-tuned large language model finetuning pipeline for subelement classification, presence estimation, and evidence extraction. Our approach incorporates element-conditioned label masking and cross-stage encoder transfer, enabling structured prediction aligned with the ABCD psychological framework. Experiments show improvements over the baseline on the development setup, with RoBERTa achieving an 8.3\% gain in macro-F1 and improved RMSE, while a fine-tuned Qwen3 model attains the best overall performance. These results demonstrate the effectiveness of combining hierarchical multi-task learning with structured generation for interpretable mental health analysis.
Self-harm presentations to emergency departments (EDs) are strongly associated with higher suicide risk. NLP models have shown strong performance in detecting self-harm from triage notes within single hospitals, yet performance often declines across institutions. To examine potential causes, we compare ED triage notes from two hospitals by analyzing lexical characteristics, highly associated predictive features, and salient topics. Our results reveal variation in lexical expression and feature importance related to self-harm across hospitals, despite consistent core themes such as self-poisoning and self-injury. These documentation differences are associated with reduced cross-site performance. These findings provide insight into how institutional variation affects the identification of self-harm in clinical text and highlight potential methods to improve model generalisability.
Therapist fidelity and competence rating scales provide a way to measure quality assurance and therapist training outcomes. Scores on these scales reflect the extent to which a therapist adheres to specific therapeutic principles during a psychotherapy session. Existing research has employed natural language processing (NLP) techniques to automatically predict scale ratings. However, existing approaches require a model trained on a dataset of therapy sessions annotated with the target rating scale.Recent work has explored directly inferring therapeutic alliance by computing semantic similarity between therapy transcripts and the Working Alliance Inventory, via cosine similarity between sentence embeddings.In this paper, we extend this line of work by computing semantic similarity between therapist talk turns and therapist fidelity scale items to directly infer fidelity to specific therapeutic modalities. We further enhance this method by augmentation with LLM-generated example therapist utterances that instantiate target behaviours (as expressed by scale items) across varied therapeutic contexts.In evaluations on two independent datasets, our example-augmented semantic similarity approach consistently shows effectiveness in discriminating therapeutic modalities and levels of therapist fidelity.

2025

Natural language processing (NLP) holds potential for analyzing psychotherapy transcripts. Nonetheless, gathering the necessary data to train NLP models for clinical tasks is a challenging process due to patient confidentiality regulations that restrict data sharing. To overcome this obstacle, we propose leveraging large language models (LLMs) to create synthetic psychotherapy dialogues that can be used to train NLP models for downstream clinical tasks. To evaluate the quality of our synthetic data, we trained three multi-task RoBERTa-based bi-encoder models, originally developed by Sharma et al., to detect empathy in dialogues. These models, initially trained on Reddit data, were developed alongside EPITOME, a framework designed to characterize empathetic communication in conversations. We collected and annotated 579 therapeutic interactions between therapists and patients using the EPITOME framework. Additionally, we generated 10,464 synthetic therapeutic dialogues using various LLMs and prompting techniques, all of which were annotated following the EPITOME framework. We conducted two experiments: one where we augmented the original dataset with synthetic data and another where we replaced the Reddit dataset with synthetic data. Our first experiment showed that incorporating synthetic data can improve the F1 score of empathy detection by up to 10%. The second experiment revealed no substantial differences between organic and synthetic data, as their performance remained on par when substituted.
Normalization of Adverse Drug Events (ADEs), or linking adverse event mentions to standardized dictionary terms, is crucial for harmonizing diverse clinical and patient-reported descriptions, enabling reliable aggregation, accurate signal detection, and effective pharmacovigilance across heterogeneous data sources. The ALTA 2025 shared task focuses on mapping extracted ADEs from documents to a standardized list of MedDRA phrases. This paper presents a system that combines rulebased methods, zero-shot and fine-tuned large language models (LLMs), along with promptbased approaches using the latest commercial LLMs to address this task. Our final system achieves an Accuracy@1 score of 0.3494, ranking second on the shared task leaderboard.

2024

Large language models have become valuable tools for data augmentation in scenarios with limited data availability, as they can generate synthetic data resembling real-world data. However, their generative performance depends on the quality of the prompt used to instruct the model. Prompt engineering that relies on hand-crafted strategies or requires domain experts to adjust the prompt often yields suboptimal results. In this paper we present SAPE, a Spanish Adaptive Prompt Engineering method utilizing genetic algorithms for prompt generation and selection. Our evaluation of SAPE focuses on a generative task that involves the creation of Spanish therapy transcripts, a type of data that is challenging to collect due to the fact that it typically includes protected health information. Through human evaluations conducted by mental health professionals, our results show that SAPE produces Spanish counselling transcripts that more closely resemble authentic therapy transcripts compared to other prompt engineering techniques that are based on Reflexion and Chain-of-Thought.
This paper presents a method called Concept Based Description Generation, aimed at creating summaries (Brief Hospital Course and Discharge Instructions) using source (Discharge and Radiology) texts. We propose a rule-based approach for segmenting both the source and target texts. In the target text, we not only segment the content but also identify the concept of each segment based on text patterns. Our methodology involves creating a combined summarized version of each text segment, extracting important information, and then fine-tuning a Large Language Model (LLM) to generate aspects. Subsequently, we fine-tune a new LLM using a specific aspect, the combined summary, and a list of all aspects to generate detailed descriptions for each task. This approach integrates segmentation, concept identification, summarization, and language modeling to achieve accurate and informative descriptions for medical documentation tasks. Due to lack to time, We could only train on 10000 training data.
Adolescents exposed to advertisements promoting addictive substances exhibit a higher likelihood of subsequent substance use. The predominant source for youth exposure to such advertisements is through online content accessed via smartphones. Detecting these advertisements is crucial for establishing and maintaining a safer online environment for young people. In our study, we utilized Multimodal Large Language Models (MLLMs) to identify addictive substance advertisements in digital media. The performance of MLLMs depends on the quality of the prompt used to instruct the model. To optimize our prompts, an adaptive prompt engineering approach was implemented, leveraging a genetic algorithm to refine and enhance the prompts. To evaluate the model’s performance, we augmented the RICO dataset, consisting of Android user interface screenshots, by superimposing alcohol ads onto them. Our results indicate that the MLLM can detect advertisements promoting alcohol with a 0.94 accuracy and a 0.94 F1 score.
We explore different information extraction tools for annotation of interventions to support automated systematic reviews of preclinical AD animal studies. We compare two PICO (Population, Intervention, Comparison, and Outcome) extraction tools and two prompting-based learning strategies based on Large Language Models (LLMs). Motivated by the high recall of a dictionary-based approach, we define a two-stage method, removing false positives obtained from regexes with a pre-trained LM. With ChatGPT-based filtering using three-shot prompting, our approach reduces almost two-thirds of False Positives compared to the dictionary approach alone, while outperforming knowledge-free instructional prompting.
Identifying self-disclosed health diagnoses in social media data using regular expressions (e.g. “I’ve been diagnosed with <Disease X>”) is a well-established approach for creating ad hoc cohorts of individuals with specific health conditions. However there is evidence to suggest that this method of identifying individuals is unreliable when creating cohorts for some mental health and neurodegenerative conditions. In the case of dementia, the focus of this paper, diagnostic disclosures are frequently whimsical or sardonic, rather than indicative of an authentic diagnosis or underlying disease state (e.g. “I forgot my keys again. I’ve got dementia!”). With this work and utilising an annotated corpus of 14,025 dementia diagnostic self-disclosure posts derived from Twitter, we leveraged LLMs to distinguish between “authentic” dementia self-disclosures and “inauthentic” self-disclosures. Specifically, we implemented a genetic algorithm that evolves prompts using various state-of-the-art prompt engineering techniques, including chain of thought, self-critique, generated knowledge, and expert prompting. Our results showed that, of the methods tested, the evolved self-critique prompt engineering method achieved the best result, with an F1-score of 0.8.
Lung cancer remains a leading cause of cancer-related deaths, but public support for individuals living with lung cancer is often constrained by stigma and misconceptions, leading to serious emotional and social consequences for those diagnosed. Understanding how this stigma manifests and affects individuals is vital for developing inclusive interventions. Online discussion forums offer a unique opportunity to examine how lung cancer stigma is expressed and experienced. This study combines qualitative analysis and unsupervised learning (topic modelling) to explore stigma-related content within an online lung cancer forum. Our findings highlight the role of online forums as a key space for addressing anti-discriminatory attitudes and sharing experiences of lung cancer stigma. We found that users both with and with- out lung cancer engage in discussions pertaining to supportive and welcoming topics, high- lighting the online forum’s role in facilitating social and informational support.

2023

Learning from real-world clinical data has potential to promote the quality of care, improve the efficiency of healthcare systems, and support clinical research. As a large proportion of clinical information is recorded only in unstructured free-text format, applying NLP to process and understand the vast amount of clinical text generated in clinical encounters is essential. However, clinical text is known to be highly ambiguous, it contains complex professional terms requiring clinical expertise to understand and annotate, and it is written in different clinical contexts with distinct purposes. All these factors together make clinical NLP research both rewarding and challenging. In this tutorial, we will discuss the characteristics of clinical text and provide an overview of some of the tools and methods used to process it. We will also present a real-world example to show the effectiveness of different NLP methods in processing and understanding clinical text. Finally, we will discuss the strengths and limitations of large language models and their applications, evaluations, and extensions in clinical NLP.

2017

In this paper, we use qualitative research methods to investigate the attitudes of social media users towards the (opt-in) integration of social media data with routine mental health care and diagnosis. Our investigation was based on secondary analysis of a series of five focus groups with Twitter users, including three groups consisting of participants with a self-reported history of depression, and two groups consisting of participants without a self reported history of depression. Our results indicate that, overall, research participants were enthusiastic about the possibility of using social media (in conjunction with automated Natural Language Processing algorithms) for mood tracking under the supervision of a mental health practitioner. However, for at least some participants, there was skepticism related to how well social media represents the mental health of users, and hence its usefulness in the clinical context.
Social connection and social isolation are associated with depressive symptoms, particularly in adolescents and young adults, but how these concepts are documented in clinical notes is unknown. This pilot study aimed to identify the topics relevant to social connection and isolation by analyzing 145 clinical notes from patients with depression diagnosis. We found that providers, including physicians, nurses, social workers, and psychologists, document descriptions of both social connection and social isolation.
In this paper, we present pilot work on characterising the documentation of electronic cigarettes (e-cigarettes) in the United States Veterans Administration Electronic Health Record. The Veterans Health Administration is the largest health care system in the United States with 1,233 health care facilities nationwide, serving 8.9 million veterans per year. We identified a random sample of 2000 Veterans Administration patients, coded as current tobacco users, from 2008 to 2014. Using simple keyword matching techniques combined with qualitative analysis, we investigated the prevalence and distribution of e-cigarette terms in these clinical notes, discovering that for current smokers, 11.9% of patient records contain an e-cigarette related term.

2016

Major depressive disorder, a debilitating and burdensome disease experienced by individuals worldwide, can be defined by several depressive symptoms (e.g., anhedonia (inability to feel pleasure), depressed mood, difficulty concentrating, etc.). Individuals often discuss their experiences with depression symptoms on public social media platforms like Twitter, providing a potentially useful data source for monitoring population-level mental health risk factors. In a step towards developing an automated method to estimate the prevalence of symptoms associated with major depressive disorder over time in the United States using Twitter, we developed classifiers for discerning whether a Twitter tweet represents no evidence of depression or evidence of depression. If there was evidence of depression, we then classified whether the tweet contained a depressive symptom and if so, which of three subtypes: depressed mood, disturbed sleep, or fatigue or loss of energy. We observed that the most accurate classifiers could predict classes with high-to-moderate F1-score performances for no evidence of depression (85), evidence of depression (52), and depressive symptoms (49). We report moderate F1-scores for depressive symptoms ranging from 75 (fatigue or loss of energy) to 43 (disturbed sleep) to 35 (depressed mood). Our work demonstrates baseline approaches for automatically encoding Twitter data with granular depressive symptoms associated with major depressive disorder.

2015

2013

2010

2009