Toward Automated Evaluation of AI-Generated Item Drafts in Clinical Assessment
Tazin Afrin, Le An Ha, Victoria Yaneva, Keelan Evanini, Steven Go, Kristine DeRuchie, Michael Heilig
Abstract
This study examines the classification of AI-generated clinical multiple-choice questions drafts as “helpful” or “non-helpful” starting points. Expert judgments were analyzed, and multiple classifiers were evaluated—including feature-based models, fine-tuned transformers, and few-shot prompting with GPT-4. Our findings highlight the challenges and considerations for evaluation methods of AI-generated items in clinical test development.- Anthology ID:
- 2025.aimecon-main.19
- Volume:
- Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
- Month:
- October
- Year:
- 2025
- Address:
- Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
- Editors:
- Joshua Wilson, Christopher Ormerod, Magdalen Beiting Parrish
- Venue:
- AIME-Con
- SIG:
- Publisher:
- National Council on Measurement in Education (NCME)
- Note:
- Pages:
- 172–182
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.aimecon-main.19/
- DOI:
- Cite (ACL):
- Tazin Afrin, Le An Ha, Victoria Yaneva, Keelan Evanini, Steven Go, Kristine DeRuchie, and Michael Heilig. 2025. Toward Automated Evaluation of AI-Generated Item Drafts in Clinical Assessment. In Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers, pages 172–182, Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States. National Council on Measurement in Education (NCME).
- Cite (Informal):
- Toward Automated Evaluation of AI-Generated Item Drafts in Clinical Assessment (Afrin et al., AIME-Con 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.aimecon-main.19.pdf