Hongjian Zhou


2025

pdf bib
DrAgent: Empowering Large Language Models as Medical Agents for Multi-hop Medical Reasoning
Fenglin Liu | Zheng Li | Hongjian Zhou | Qingyu Yin | Jingfeng Yang | Xin Liu | Zhengyang Wang | Xianfeng Tang | Shiyang Li | Xiang He | Ruijie Wang | Bing Yin | Xiao Gu | Lei Clifton | David A. Clifton
Findings of the Association for Computational Linguistics: EMNLP 2025

Although large language models (LLMs) have demonstrated outperforming human experts in medical examinations, it remains challenging to adopt LLMs in real-world clinical decision-making that typically involves multi-hop medical reasoning. Common practices include prompting commercial LLMs and fine-tuning LLMs on medical data. However, in the clinical domain, using commercial LLMs raises privacy concerns regarding sensitive patient data. Fine-tuning competitive medical LLMs for different tasks usually requires extensive data and computing resources, which are difficult to acquire, especially in medical institutions with limited infrastructure. We propose DrAgent, which can build LLMs as agents to deliver accurate medical decision-making and reasoning. In implementation, we take a lightweight LLM as the backbone to collaborate with diverse clinical tools. To make efficient use of data, DrAgent introduces recursive curriculum learning to optimize the LLM in an easy-to-hard progression. The results show that our approach achieves competitive performance on diverse datasets.

2024

pdf bib
Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark
Fenglin Liu | Zheng Li | Hongjian Zhou | Qingyu Yin | Jingfeng Yang | Xianfeng Tang | Chen Luo | Ming Zeng | Haoming Jiang | Yifan Gao | Priyanka Nigam | Sreyashi Nag | Bing Yin | Yining Hua | Xuan Zhou | Omid Rohanian | Anshul Thakur | Lei Clifton | David A. Clifton
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs