Comparative Evaluation of AI-Generated vs. Expert-written Answer Explanations for a Medical Education Self-Assessment

Yiyun Zhou; Francis O’Donnell; Victoria Yaneva

Comparative Evaluation of AI-Generated vs. Expert-written Answer Explanations for a Medical Education Self-Assessment

Yiyun Zhou, Francis O’Donnell, Victoria Yaneva

Abstract

Answer explanations for medical multiple-choice questions (MCQs) are a valuable learning tool, but producing them is resource intensive. Writing high quality explanations requires specialized medical expertise and careful alignment with the keyed answer, distractors, and the clinical vignette. This paper evaluates whether a template-aware, retrieval-guided large language model (LLM) workflow can support this production task in a real formative assessment setting. Using a 50-item medical education self-assessment, we compared AI-generated and expert-written MCQ explanations in a blinded study involving eight medical faculty and sixteen medical students. Each participant rated 25 of 50 paired explanations on clarity, amount of information, and structure. The clearest empirical difference was in amount of information: AI-generated explanations were rated significantly higher than expert-written explanations in a cumulative link mixed model analysis (OR = 1.99, 95% CI [1.33, 2.99], p = 0.001). Ratings of clarity and structure did not differ significantly between conditions. Based on faculty ratings, a smaller proportion of AI-generated explanations were judged to require correction (20%) compared with expert-written explanations (38%). These findings suggest that AI can reduce first-draft authoring effort in explanation writing while still requiring expert review to ensure content accuracy.

Anthology ID:: 2026.bea-1.31
Volume:: Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Ekaterina Kochmar, Bashar Alhafni, Stefano Bannò, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anais Tack, Victoria Yaneva, Zheng Yuan
Venues:: BEA | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 455–462
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.31/
DOI:
Bibkey:
Cite (ACL):: Yiyun Zhou, Francis O’Donnell, and Victoria Yaneva. 2026. Comparative Evaluation of AI-Generated vs. Expert-written Answer Explanations for a Medical Education Self-Assessment. In Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), pages 455–462, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Comparative Evaluation of AI-Generated vs. Expert-written Answer Explanations for a Medical Education Self-Assessment (Zhou et al., BEA 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.31.pdf

PDF Cite Search Fix data