Xinyue Zhang

2025

pdf bib abs
Improving Barrett’s Oesophagus Surveillance Scheduling with Large Language Models: A Structured Extraction Approach
Xinyue Zhang | Agathe Zecevic | Sebastian Zeki | Angus Roberts
Proceedings of the 24th Workshop on Biomedical Language Processing

Gastroenterology (GI) cancer surveillance scheduling relies on extracting structured data from unstructured clinical texts, such as endoscopy and pathology reports. Traditional Natural Language Processing (NLP) models have been employed for this task, but recent advancements in Large Language Models (LLMs) present a new opportunity for automation without requiring extensive labeled datasets. In this study, we propose an LLM-based entity extraction and rule-based decision support framework for Barrett’s Oesophagus (BO) surveillance timing prediction. Our approach processes endoscopy and pathology reports to extract clinically relevant information and structures it into a standardised format, which is then used to determine appropriate surveillance intervals. We evaluate multiple state-of-the-art LLMs on real-world clinical datasets from two hospitals, assessing their performance in accuracy and running time cost. The results demonstrate that LLMs, particularly Phi-4 and (DeepSeek distilled) Qwen-2.5, can effectively automate the extraction of BO surveillance-related information with high accuracy, while Phi-4 is also efficient during inference. We also compared the trade-offs between LLMs and fine-tuned non-LLMs. Our findings indicate that LLM extraction based methods can support clinical decision-making by providing justifications from report extractions, reducing manual workload, and improving guideline adherence in BO surveillance scheduling.

pdf bib abs
ARGENT: Automatic Reference-free Evaluation for Open-Ended Text Generation without Source Inputs
Xinyue Zhang | Agathe Zecevic | Sebastian Zeki | Angus Roberts
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

With increased accessibility of machine-generated texts, the need for their evaluation has also grown. There are broadly two types of text generation tasks. In open-ended generation tasks (OGTs), the model generates de novo text without any input on which to base it, such as story generation. In reflective generation tasks (RGTs), the model output is generated to reflect an input sequence, such as in machine translation. There are many studies on RGT evaluation, where the metrics typically compare one or more gold-standard references to the model output. Evaluation of OGTs has received less attention and is more challenging: since the task does not aim to reflect an input, there are usually no reference texts. In this paper, we propose a new perspective that unifies OGT evaluation with RGT evaluation, based on which we develop an automatic, reference-free generative text evaluation model (ARGENT), and review previous literature from this perspective. Our experiments demonstrate the effectiveness of these methods across informal, formal, and domain-specific texts. We conduct a meta-evaluation to compare existing and proposed metrics, finding that our approach aligns more closely with human judgement.

2024

pdf bib abs
Generation and Evaluation of Synthetic Endoscopy Free-Text Reports with Differential Privacy
Agathe Zecevic | Xinyue Zhang | Sebastian Zeki | Angus Roberts
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

The development of NLP models in the healthcare sector faces important challenges due to the limited availability of patient data, mainly driven by privacy concerns. This study proposes the generation of synthetic free-text medical reports, specifically focusing on the gastroenterology domain, to address the scarcity of specialised datasets, while preserving patient privacy. We fine-tune BioGPT on over 90 000 endoscopy reports and integrate Differential Privacy (DP) into the training process. 10 000 DP-private synthetic reports are generated by this model. The generated synthetic data is evaluated through multiple dimensions: similarity to real datasets, language quality, and utility in both supervised and semi-supervised NLP tasks. Results suggest that while DP integration impacts text quality, it offers a promising balance between data utility and privacy, improving the performance of a real-world downstream task. Our study underscores the potential of synthetic data to facilitate model development in the healthcare domain without compromising patient privacy.

Co-authors

Venues

Fix author