Evaluating LLM Workflows for Generating Clinical Communication Assessment Items: A Comparative Study with Subject-Matter Experts

Christopher Runyon; Peter Baldwin; Ian Micir; Kevin Frome; Stephanie Mann; Saed Rezayi; Keelan Evanini; Victoria Yaneva

Evaluating LLM Workflows for Generating Clinical Communication Assessment Items: A Comparative Study with Subject-Matter Experts

Christopher Runyon, Peter Baldwin, Ian Micir, Kevin Frome, Stephanie Mann, Saed Rezayi, Keelan Evanini, Victoria Yaneva

Abstract

Generative AI is increasingly used to accelerate assessment content development, yet its effectiveness for generating content used in complex assessment tasks for knowledge-rich domains such as medical education is unclear. This study evaluates automated LLM-supported workflows for generating patient-centered communication assessment items that allow students to practice their communication skills. We compared two content generation approaches—constrained linear and exploratory branching—each implemented with and without anchoring in vetted multiple-choice questions (MCQs). Ten subject-matter experts (SMEs) evaluated 80 communication items across six quality dimensions using structured rubrics. The constrained linear approach yielded better ratings than exploratory branching approaches, particularly for medical accuracy and alignment with learning objectives and patient-centered behaviors. MCQ anchoring did not improve medical accuracy. Only a minority of items met all criteria without requiring revision, and no items were unanimously approved by all SMEs. These findings underscore the importance of workflow design in LLM-supported assessment content generation, the continued need for human oversight, and the current limitations of automated content generation in medical education.

Anthology ID:: 2026.bea-1.12
Volume:: Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Ekaterina Kochmar, Bashar Alhafni, Stefano Bannò, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anais Tack, Victoria Yaneva, Zheng Yuan
Venues:: BEA | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 153–173
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.12/
DOI:
Bibkey:
Cite (ACL):: Christopher Runyon, Peter Baldwin, Ian Micir, Kevin Frome, Stephanie Mann, Saed Rezayi, Keelan Evanini, and Victoria Yaneva. 2026. Evaluating LLM Workflows for Generating Clinical Communication Assessment Items: A Comparative Study with Subject-Matter Experts. In Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), pages 153–173, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Evaluating LLM Workflows for Generating Clinical Communication Assessment Items: A Comparative Study with Subject-Matter Experts (Runyon et al., BEA 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.12.pdf

PDF Cite Search Fix data