Kevin Frome
2026
Conversational AI for Virtual Standardized Patients using a Speech-to-Speech LLM
Andrew Emerson | Keelan Evanini | Su Somay | Kevin Frome | Le An Ha | Polina Harik
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Andrew Emerson | Keelan Evanini | Su Somay | Kevin Frome | Le An Ha | Polina Harik
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
To develop clinical reasoning skills, medical students are often tasked with interacting with trained standardized patients (SPs). Human SPs enable real conversations that can resemble authentic clinical scenarios. However, human SPs require extensive training and are often limited in their accessibility and continual availability to medical students or residents. Virtual SPs offer the ability for medical students to practice clinical interviews in a lower-stakes setting across a broader set of clinical cases. This paper introduces a virtual SP (VSP) that leverages Amazon’s Nova Sonic, a speech-to-speech foundation model designed for human-like conversation. We investigated the ability of Nova Sonic to portray four distinct clinical cases in virtual doctor-patient encounters with 20 third-year medical students. The system’s realism, its perceived learning value, and user experience were all assessed via a survey administered to the students. Students were also asked to compare this experience to interactions with a human SP. Survey results and conversations were analyzed to derive insights for improving the Nova Sonic-based VSP system.
Evaluating LLM Workflows for Generating Clinical Communication Assessment Items: A Comparative Study with Subject-Matter Experts
Christopher Runyon | Peter Baldwin | Ian Micir | Kevin Frome | Stephanie Mann | Saed Rezayi | Keelan Evanini | Victoria Yaneva
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Christopher Runyon | Peter Baldwin | Ian Micir | Kevin Frome | Stephanie Mann | Saed Rezayi | Keelan Evanini | Victoria Yaneva
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Generative AI is increasingly used to accelerate assessment content development, yet its effectiveness for generating content used in complex assessment tasks for knowledge-rich domains such as medical education is unclear. This study evaluates automated LLM-supported workflows for generating patient-centered communication assessment items that allow students to practice their communication skills. We compared two content generation approaches—constrained linear and exploratory branching—each implemented with and without anchoring in vetted multiple-choice questions (MCQs). Ten subject-matter experts (SMEs) evaluated 80 communication items across six quality dimensions using structured rubrics. The constrained linear approach yielded better ratings than exploratory branching approaches, particularly for medical accuracy and alignment with learning objectives and patient-centered behaviors. MCQ anchoring did not improve medical accuracy. Only a minority of items met all criteria without requiring revision, and no items were unanimously approved by all SMEs. These findings underscore the importance of workflow design in LLM-supported assessment content generation, the continued need for human oversight, and the current limitations of automated content generation in medical education.
2025
Automated Evaluation of Standardized Patients with LLMs
Andrew Emerson | Le An Ha | Keelan Evanini | Su Somay | Kevin Frome | Polina Harik | Victoria Yaneva
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Andrew Emerson | Le An Ha | Keelan Evanini | Su Somay | Kevin Frome | Polina Harik | Victoria Yaneva
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Standardized patients (SPs) are essential for clinical reasoning assessments in medical education. This paper introduces evaluation metrics that apply to both human and simulated SP systems. The metrics are computed using two LLM-as-a-judge approaches that align with human evaluators on SP performance, enabling scalable formative clinical reasoning assessments.