Yutian Zhao


2024

pdf
imapScore: Medical Fact Evaluation Made Easy
Huimin Wang | Yutian Zhao | Xian Wu | Yefeng Zheng
Findings of the Association for Computational Linguistics: ACL 2024

Automatic evaluation of natural language generation (NLG) tasks has gained extensive research interests, since it can rapidly assess the performance of large language models (LLMs). However, automatic NLG evaluation struggles with medical QA because it fails to focus on the crucial correctness of medical facts throughout the generated text. To address this, this paper introduces a new data structure, imap, designed to capture key information in questions and answers, enabling evaluators to focus on essential details. The imap comprises three components: Query, Constraint, and Inform, each of which is in the form of term-value pairs to represent medical facts in a structural manner. We then introduce imapScore, which compares the corresponding medical term-value pairs in the imap to score generated texts. We utilize GPT-4 to extract imap from questions, human-annotated answers, and generated responses. To mitigate the diversity in medical terminology for fair term-value pairs comparison, we use a medical knowledge graph to assist GPT-4 in determining matches. To compare imapScore with existing NLG metrics, we establish a new benchmark dataset. The experimental results show that imapScore consistently outperforms state-of-the-art metrics, demonstrating an average improvement of 79.8% in correlation with human scores. Furthermore, incorporating imap into n-gram, embedding, and LLM metrics boosts the base versions, increasing correlation with human scores by averages of 89.9%, 81.7%, and 32.6%, respectively.

pdf
Can LLMs Replace Clinical Doctors? Exploring Bias in Disease Diagnosis by Large Language Models
Yutian Zhao | Huimin Wang | Yuqi Liu | Wu Suhuang | Xian Wu | Yefeng Zheng
Findings of the Association for Computational Linguistics: EMNLP 2024

The bias of disease prediction in Large Language Models (LLMs) is a critical yet underexplored issue, with potential implications for healthcare outcomes and equity. As LLMs increasingly find applications in healthcare, understanding and addressing their biases becomes paramount. This study focuses on this crucial topic, investigating the bias of disease prediction in models such as GPT-4, ChatGPT, and Qwen1.5-72b across gender, age range, and disease judgment behaviors. Utilizing a comprehensive real-clinical health record dataset of over 330,000 entries, we uncover that all three models exhibit distinct biases, indicating a pervasive issue of unfairness. To measure this, we introduce a novel metric–the diagnosis bias score, which reflects the ratio of prediction numbers to label numbers. Our in-depth analysis, based on this score, sheds light on the inherent biases in these models. In response to these findings, we propose a simple yet effective prompt-based solution to alleviate the observed bias in disease prediction with LLMs. This research underscores the importance of fairness in AI, particularly in healthcare applications, and offers a practical approach to enhance the equity of disease prediction models.

pdf
MKeCL: Medical Knowledge-Enhanced Contrastive Learning for Few-shot Disease Diagnosis
Yutian Zhao | Huimin Wang | Xian Wu | Yefeng Zheng
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Artificial intelligence (AI)-aided disease prediction has gained extensive research interest due to its capability to support clinical decision-making. Existing works mainly formulate disease prediction as a multi-label classification problem and use historical Electronic Medical Records (EMR) to train supervised models. However, in real-world clinics, such purely data-driven approaches pose two main challenges: 1) long tail problem: there are excessive EMRs for common diseases and insufficient EMRs for rare diseases, thus training over an imbalanced data set could result in a biased model that ignores rare diseases in diagnosis; 2) easily misdiagnosed diseases: some diseases can be easily distinguished while others sharing analogous conditions are much more difficult. General classification models without emphasizing easily misdiagnosed diseases may generate incorrect predictions. To tackle these two problems, we propose a Medical Knowledge-Enhanced Contrastive Learning (MKeCL) approach to disease diagnosis in this paper. MKeCL incorporates medical knowledge graphs and medical licensing exams in modeling in order to compensate for the insufficient information on rare diseases; To handle hard-to-diagnose diseases, MKeCL introduces a contrastive learning strategy to separate diseases that are easily misdiagnosed. Moreover, we establish a new benchmark, named Jarvis-D, which contains clinical EMRs collected from various hospitals. Experiments on real clinical EMRs show that the proposed MKeCL outperforms existing disease prediction approaches, especially in the setting of few-shot and zero-shot scenarios.