Polina Harik


2024

pdf
Automated Scoring of Clinical Patient Notes: Findings From the Kaggle Competition and Their Translation into Practice
Victoria Yaneva | King Yiu Suen | Le An Ha | Janet Mee | Milton Quranda | Polina Harik
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Scoring clinical patient notes (PNs) written by medical students is a necessary but resource-intensive task in medical education. This paper describes the organization and key lessons from a Kaggle competition on automated scoring of such notes. 1,471 teams took part in the competition and developed an extensive, publicly available code repository of varying solutions evaluated over the first public dataset for this task. The most successful approaches from this community effort are described and utilized in the development of a PN scoring system. We discuss the choice of models and system architecture with a view to operational use and scalability, and evaluate its performance on both the public Kaggle data (10 clinical cases, 43,985 PNs) and an extended internal dataset (178 clinical cases, 6,940 PNs). The results show that the system significantly outperforms a state-of-the-art existing tool for PN scoring and that task-adaptive pretraining using masked language modeling can be an effective approach even for small training samples.

pdf
Findings from the First Shared Task on Automated Prediction of Difficulty and Response Time for Multiple-Choice Questions
Victoria Yaneva | Kai North | Peter Baldwin | Le An Ha | Saed Rezayi | Yiyun Zhou | Sagnik Ray Choudhury | Polina Harik | Brian Clauser
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

This paper reports findings from the First Shared Task on Automated Prediction of Difficulty and Response Time for Multiple-Choice Questions. The task was organized as part of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA’24), held in conjunction with NAACL 2024, and called upon the research community to contribute solutions to the problem of modeling difficulty and response time for clinical multiple-choice questions (MCQs). A set of 667 previously used and now retired MCQs from the United States Medical Licensing Examination (USMLE®) and their corresponding difficulties and mean response times were made available for experimentation. A total of 17 teams submitted solutions and 12 teams submitted system report papers describing their approaches. This paper summarizes the findings from the shared task and analyzes the main approaches proposed by the participants.

2023

pdf
ACTA: Short-Answer Grading in High-Stakes Medical Exams
King Yiu Suen | Victoria Yaneva | Le An Ha | Janet Mee | Yiyun Zhou | Polina Harik
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

This paper presents the ACTA system, which performs automated short-answer grading in the domain of high-stakes medical exams. The system builds upon previous work on neural similarity-based grading approaches by applying these to the medical domain and utilizing contrastive learning as a means to optimize the similarity metric. ACTA is evaluated against three strong baselines and is developed in alignment with operational needs, where low-confidence responses are flagged for human review. Learning curves are explored to understand the effects of training data on performance. The results demonstrate that ACTA leads to substantially lower number of responses being flagged for human review, while maintaining high classification accuracy.

2022

pdf
The USMLE® Step 2 Clinical Skills Patient Note Corpus
Victoria Yaneva | Janet Mee | Le Ha | Polina Harik | Michael Jodoin | Alex Mechaber
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This paper presents a corpus of 43,985 clinical patient notes (PNs) written by 35,156 examinees during the high-stakes USMLE® Step 2 Clinical Skills examination. In this exam, examinees interact with standardized patients - people trained to portray simulated scenarios called clinical cases. For each encounter, an examinee writes a PN, which is then scored by physician raters using a rubric of clinical concepts, expressions of which should be present in the PN. The corpus features PNs from 10 clinical cases, as well as the clinical concepts from the case rubrics. A subset of 2,840 PNs were annotated by 10 physician experts such that all 143 concepts from the case rubrics (e.g., shortness of breath) were mapped to 34,660 PN phrases (e.g., dyspnea, difficulty breathing). The corpus is available via a data sharing agreement with NBME and can be requested at https://www.nbme.org/services/data-sharing.

2020

pdf
Automated Prediction of Examinee Proficiency from Short-Answer Questions
Le An Ha | Victoria Yaneva | Polina Harik | Ravi Pandian | Amy Morales | Brian Clauser
Proceedings of the 28th International Conference on Computational Linguistics

This paper brings together approaches from the fields of NLP and psychometric measurement to address the problem of predicting examinee proficiency from responses to short-answer questions (SAQs). While previous approaches train on manually labeled data to predict the human-ratings assigned to SAQ responses, the approach presented here models examinee proficiency directly and does not require manually labeled data to train on. We use data from a large medical exam where experimental SAQ items are embedded alongside 106 scored multiple-choice questions (MCQs). First, the latent trait of examinee proficiency is measured using the scored MCQs and then a model is trained on the experimental SAQ responses as input, aiming to predict proficiency as its target variable. The predicted value is then used as a “score” for the SAQ response and evaluated in terms of its contribution to the precision of proficiency estimation.