Yajie He
2026
WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
Zachary Ellis | Jared Joselowitz | Yash Deo | Yajie He | Anna Kalygina | Aisling Higham | Mana Rahimzadeh | Yan Jia | Ibrahim Habli | Ernest Lim
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Zachary Ellis | Jared Joselowitz | Yash Deo | Yajie He | Anna Kalygina | Aisling Higham | Mana Rahimzadeh | Yan Jia | Ibrahim Habli | Ernest Lim
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen’s kappa of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
2023
Can Large Language Models Safely Address Patient Questions Following Cataract Surgery?
Mohita Chowdhury | Ernest Lim | Aisling Higham | Rory McKinnon | Nikoletta Ventoura | Yajie He | Nick De Pennington
Proceedings of the 5th Clinical Natural Language Processing Workshop
Mohita Chowdhury | Ernest Lim | Aisling Higham | Rory McKinnon | Nikoletta Ventoura | Yajie He | Nick De Pennington
Proceedings of the 5th Clinical Natural Language Processing Workshop
Recent advances in large language models (LLMs) have generated significant interest in their application across various domains including healthcare. However, there is limited data on their safety and performance in real-world scenarios. This study uses data collected using an autonomous telemedicine clinical assistant. The assistant asks symptom-based questions to elicit patient concerns and allows patients to ask questions about their post-operative recovery. We utilise real-world postoperative questions posed to the assistant by a cohort of 120 patients to examine the safety and appropriateness of responses generated by a recent popular LLM by OpenAI, ChatGPT. We demonstrate that LLMs have the potential to helpfully address routine patient queries following routine surgery. However, important limitations around the safety of today’s models exist which must be considered.