Thomas Searle


2025

pdf bib
Named Entity Inference Attacks on Clinical LLMs: Exploring Privacy Risks and the Impact of Mitigation Strategies
Adam Sutton | Xi Bai | Kawsar Noor | Thomas Searle | Richard Dobson
Proceedings of the Sixth Workshop on Privacy in Natural Language Processing

Transformer-based Large Language Models (LLMs) have achieved remarkable success across various domains, including clinical language processing, where they enable state-of-the-art performance in numerous tasks. Like all deep learning models, LLMs are susceptible to inference attacks that exploit sensitive attributes seen during training. AnonCAT, a RoBERTa-based masked language model, has been fine-tuned to de-identify sensitive clinical textual data. The community has a responsibility to explore the privacy risks of these models. This work proposes an attack method to infer sensitive named entities used in the training of AnonCAT models. We perform three experiments; the privacy implications of generating multiple names, the impact of white-box and black-box on attack inference performance, and the privacy-enhancing effects of Differential Privacy (DP) when applied to AnonCAT. By providing real textual predictions and privacy leakage metrics, this research contributes to understanding and mitigating the potential risks associated with exposing LLMs in sensitive domains like healthcare.

2020

pdf bib
Experimental Evaluation and Development of a Silver-Standard for the MIMIC-III Clinical Coding Dataset
Thomas Searle | Zina Ibrahim | Richard Dobson
Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing

Clinical coding is currently a labour-intensive, error-prone, but a critical administrative process whereby hospital patient episodes are manually assigned codes by qualified staff from large, standardised taxonomic hierarchies of codes. Automating clinical coding has a long history in NLP research and has recently seen novel developments setting new benchmark results. A popular dataset used in this task is MIMIC-III, a large database of clinical free text notes and their associated codes amongst other data. We argue for the reconsideration of the validity MIMIC-III’s assigned codes, as MIMIC-III has not undergone secondary validation. This work presents an open-source, reproducible experimental methodology for assessing the validity of EHR discharge summaries. We exemplify the methodology with MIMIC-III discharge summaries and show the most frequently assigned codes in MIMIC-III are undercoded up to 35%.

2019

pdf bib
MedCATTrainer: A Biomedical Free Text Annotation Interface with Active Learning and Research Use Case Specific Customisation
Thomas Searle | Zeljko Kraljevic | Rebecca Bendayan | Daniel Bean | Richard Dobson
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

An interface for building, improving and customising a given Named Entity Recognition and Linking (NER+L) model for biomedical domain text, and the efficient collation of accurate research use case specific training data and subsequent model training. Screencast demo available here: https://www.youtube.com/watch?v=lM914DQjvSo