Saed Rezayi
2026
Policy-Sensitive Fairness Evaluation in Automated Scoring of Clinical Communication
Saed Rezayi | Le An Ha | Victoria Yaneva | Polina Harik | Janet Mee | Jason Snyder
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Saed Rezayi | Le An Ha | Victoria Yaneva | Polina Harik | Janet Mee | Jason Snyder
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
This study examines automated scoring fairness in a formative assessment context: the automated evaluation of medical students’ communication skills. Building on the premise that definitions of fairness are value-dependent, we investigate how conclusions about group differences may vary under different weighting schemes for false positives (FPs) and false negatives (FNs). Results show that when errors are treated symmetrically, no statistically significant differences are observed across demographic groups based on race or gender. This pattern remains stable when error weights are varied, with no consistent or robust disparities emerging. A small number of isolated differences appear under moderate FN weighting. Overall, the findings suggest that fairness conclusions in this setting are relatively robust to variations in error weighting. At the same time, the study highlights the importance of making value assumptions explicit when evaluating automated scoring systems, particularly in formative contexts where error trade-offs carry pedagogical implications for feedback, learner engagement, and educational equity.
Evaluating LLM Workflows for Generating Clinical Communication Assessment Items: A Comparative Study with Subject-Matter Experts
Christopher Runyon | Peter Baldwin | Ian Micir | Kevin Frome | Stephanie Mann | Saed Rezayi | Keelan Evanini | Victoria Yaneva
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Christopher Runyon | Peter Baldwin | Ian Micir | Kevin Frome | Stephanie Mann | Saed Rezayi | Keelan Evanini | Victoria Yaneva
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Generative AI is increasingly used to accelerate assessment content development, yet its effectiveness for generating content used in complex assessment tasks for knowledge-rich domains such as medical education is unclear. This study evaluates automated LLM-supported workflows for generating patient-centered communication assessment items that allow students to practice their communication skills. We compared two content generation approaches—constrained linear and exploratory branching—each implemented with and without anchoring in vetted multiple-choice questions (MCQs). Ten subject-matter experts (SMEs) evaluated 80 communication items across six quality dimensions using structured rubrics. The constrained linear approach yielded better ratings than exploratory branching approaches, particularly for medical accuracy and alignment with learning objectives and patient-centered behaviors. MCQ anchoring did not improve medical accuracy. Only a minority of items met all criteria without requiring revision, and no items were unanimously approved by all SMEs. These findings underscore the importance of workflow design in LLM-supported assessment content generation, the continued need for human oversight, and the current limitations of automated content generation in medical education.
2025
Automated Scoring of Communication Skills in Physician-Patient Interaction: Balancing Performance and Scalability
Saed Rezayi | Le An Ha | Yiyun Zhou | Andrew Houriet | Angelo D’Addario | Peter Baldwin | Polina Harik | Ann King | Victoria Yaneva
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Saed Rezayi | Le An Ha | Yiyun Zhou | Andrew Houriet | Angelo D’Addario | Peter Baldwin | Polina Harik | Ann King | Victoria Yaneva
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
This paper presents an automated scoring approach for a formative assessment tool aimed at helping learner physicians enhance their communication skills through simulated patient interactions. The system evaluates transcribed learner responses by detecting key communicative behaviors, such as acknowledgment, empathy, and clarity. Built on an adapted version of the ACTA scoring framework, the model achieves a mean binary F1 score of 0.94 across 8 clinical scenarios. A central contribution of this work is the investigation of how to balance scoring accuracy with scalability. We demonstrate that synthetic training data offers a promising path toward reducing reliance on large, annotated datasets—making automated scoring more accurate and scalable.
Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems
Sahar Yarmohammadtoosky | Yiyun Zhou | Victoria Yaneva | Peter Baldwin | Saed Rezayi | Brian Clauser | Polina Harik
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Sahar Yarmohammadtoosky | Yiyun Zhou | Victoria Yaneva | Peter Baldwin | Saed Rezayi | Brian Clauser | Polina Harik
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
This study examines vulnerabilities in transformer-based automated short-answer grading systems used in medical education, with a focus on how these systems can be manipulated through adversarial gaming strategies. Our research identifies three main types of gaming strategies that exploit the system’s weaknesses, potentially leading to false positives. To counteract these vulnerabilities, we implement several adversarial training methods designed to enhance the system’s robustness. Our results indicate that these methods significantly reduce the susceptibility of grading systems to such manipulations, especially when combined with ensemble techniques like majority voting and Ridge regression, which further improve the system’s defense against sophisticated adversarial inputs. Additionally, employing large language models suchasGPT-4with varied prompting techniques has shown promise in recognizing and scoring gaming strategies effectively. The findings underscore the importance of continuous improvements in AI-driven educational tools to ensure their reliability and fairness in high-stakes settings.
Towards Reliable Generation of Clinical Chart Items: A Counterfactual Reasoning Approach with Large Language Models
Jiaxuan Li | Saed Rezayi | Peter Baldwin | Polina Harik | Victoria Yaneva
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Jiaxuan Li | Saed Rezayi | Peter Baldwin | Polina Harik | Victoria Yaneva
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
This study explores GPT-4 for generating clinical chart items in medical education using three prompting strategies. Expert evaluations found many items usable or promising. The counterfactual approach enhanced novelty, and item quality improved with high-surprisal examples. This is the first investigation of LLMs for automated clinical chart item generation.
2024
Findings from the First Shared Task on Automated Prediction of Difficulty and Response Time for Multiple-Choice Questions
Victoria Yaneva | Kai North | Peter Baldwin | Le An Ha | Saed Rezayi | Yiyun Zhou | Sagnik Ray Choudhury | Polina Harik | Brian Clauser
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
Victoria Yaneva | Kai North | Peter Baldwin | Le An Ha | Saed Rezayi | Yiyun Zhou | Sagnik Ray Choudhury | Polina Harik | Brian Clauser
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
This paper reports findings from the First Shared Task on Automated Prediction of Difficulty and Response Time for Multiple-Choice Questions. The task was organized as part of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA’24), held in conjunction with NAACL 2024, and called upon the research community to contribute solutions to the problem of modeling difficulty and response time for clinical multiple-choice questions (MCQs). A set of 667 previously used and now retired MCQs from the United States Medical Licensing Examination (USMLE®) and their corresponding difficulties and mean response times were made available for experimentation. A total of 17 teams submitted solutions and 12 teams submitted system report papers describing their approaches. This paper summarizes the findings from the shared task and analyzes the main approaches proposed by the participants.
2021
Edge: Enriching Knowledge Graph Embeddings with External Text
Saed Rezayi | Handong Zhao | Sungchul Kim | Ryan Rossi | Nedim Lipka | Sheng Li
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Saed Rezayi | Handong Zhao | Sungchul Kim | Ryan Rossi | Nedim Lipka | Sheng Li
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Knowledge graphs suffer from sparsity which degrades the quality of representations generated by various methods. While there is an abundance of textual information throughout the web and many existing knowledge bases, aligning information across these diverse data sources remains a challenge in the literature. Previous work has partially addressed this issue by enriching knowledge graph entities based on “hard” co-occurrence of words present in the entities of the knowledge graphs and external text, while we achieve “soft” augmentation by proposing a knowledge graph enrichment and embedding framework named Edge. Given an original knowledge graph, we first generate a rich but noisy augmented graph using external texts in semantic and structural level. To distill the relevant knowledge and suppress the introduced noise, we design a graph alignment term in a shared embedding space between the original graph and augmented graph. To enhance the embedding learning on the augmented graph, we further regularize the locality relationship of target entity based on negative sampling. Experimental results on four benchmark datasets demonstrate the robustness and effectiveness of Edge in link prediction and node classification.