Yiyun Zhou


2024

pdf
Findings from the First Shared Task on Automated Prediction of Difficulty and Response Time for Multiple-Choice Questions
Victoria Yaneva | Kai North | Peter Baldwin | Le An Ha | Saed Rezayi | Yiyun Zhou | Sagnik Ray Choudhury | Polina Harik | Brian Clauser
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

This paper reports findings from the First Shared Task on Automated Prediction of Difficulty and Response Time for Multiple-Choice Questions. The task was organized as part of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA’24), held in conjunction with NAACL 2024, and called upon the research community to contribute solutions to the problem of modeling difficulty and response time for clinical multiple-choice questions (MCQs). A set of 667 previously used and now retired MCQs from the United States Medical Licensing Examination (USMLE®) and their corresponding difficulties and mean response times were made available for experimentation. A total of 17 teams submitted solutions and 12 teams submitted system report papers describing their approaches. This paper summarizes the findings from the shared task and analyzes the main approaches proposed by the participants.

2023

pdf
ACTA: Short-Answer Grading in High-Stakes Medical Exams
King Yiu Suen | Victoria Yaneva | Le An Ha | Janet Mee | Yiyun Zhou | Polina Harik
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

This paper presents the ACTA system, which performs automated short-answer grading in the domain of high-stakes medical exams. The system builds upon previous work on neural similarity-based grading approaches by applying these to the medical domain and utilizing contrastive learning as a means to optimize the similarity metric. ACTA is evaluated against three strong baselines and is developed in alignment with operational needs, where low-confidence responses are flagged for human review. Learning curves are explored to understand the effects of training data on performance. The results demonstrate that ACTA leads to substantially lower number of responses being flagged for human review, while maintaining high classification accuracy.