Michael Heilig


2025

This study examines the classification of AI-generated clinical multiple-choice questions drafts as “helpful” or “non-helpful” starting points. Expert judgments were analyzed, and multiple classifiers were evaluated—including feature-based models, fine-tuned transformers, and few-shot prompting with GPT-4. Our findings highlight the challenges and considerations for evaluation methods of AI-generated items in clinical test development.