Weiyan Zhang
2026
MedKInstruct: A Multimodal Knowledge Graph Based Framework for Multi-Hop and Hard-Negative Instruction Data Synthesis in MedVQA
Yinan Wu | Jihang Jin | Xuhao Bao | Weiyan Zhang | Hanjing Yan | Tong Ruan | ChunMing Wang
Findings of the Association for Computational Linguistics: ACL 2026
Yinan Wu | Jihang Jin | Xuhao Bao | Weiyan Zhang | Hanjing Yan | Tong Ruan | ChunMing Wang
Findings of the Association for Computational Linguistics: ACL 2026
Medical visual question answering (MedVQA) requires models to provide accurate answers given a medical image and a corresponding question. Recently, instruction tuning of general large vision–language models (LVLMs) has become a dominant paradigm for this task, enabling open-ended predictions and effective integration of multimodal information. However, existing methods synthesize instruction data from image–caption pairs that primarily focus on visual attributes, rather than knowledge-level QA generation. This situation limits the model’s ability to learn relevant medical knowledge during training, thereby restricting its performance on MedVQA. Hence, this paper proposes MedKInstruct, which incorporates a multimodal medical knowledge graph (MMKG) to assist LVLMs in synthesizing knowledge-intensive instruction data. Additionally, we design an MMKG path–based reward function to train a stronger MedVQA model through reinforcement learning. Experimental results on the public datasets Slake and VQA-RAD show that MedKInstruct outperforms previous methods by 4.16% and 4.50%. The source code is available at the following link: https://github.com/Sonder-hang/MedKinstruct
PlanE: Meta Planning of Data, Tuning, and Inference for Extractive-based LLMs
Jiacheng Wang | Weiyan Zhang | Guangya Yu
Findings of the Association for Computational Linguistics: ACL 2026
Jiacheng Wang | Weiyan Zhang | Guangya Yu
Findings of the Association for Computational Linguistics: ACL 2026
Enhancing the task-specific capabilities of Large Language Models (LLMs) primarily requires substantial instruction-tuning datasets. However, the sheer volume of such data imposes a considerable annotation cost, and a lack of optimization methods for tailoring LLMs to specific tasks persists. To address the above issues, we propose a Planning framework for constructing Extractive-based LLMs called PlanE, which includes data decomposition, instruction tuning, and prompt inference. Additionally, we introduce a Data-Tuning-Inference (DTI) planner, aimed at selecting the optimal base-LLM and its DTI combinations for specific datasets to improve construction efficiency. The experimental results demonstrate the effectiveness of our PlanE from two views: (1) across different datasets using the same base-LLM, and (2) on the same dataset using different base-LLMs. Furthermore, we validate the generalizability of the proposed DTI planner under different optimization objectives. The codes are publicly available at https://github.com/gugugu-469/PlanE.
Experience is the Teacher: Reusing Atomic Thoughts from LLMs to Improve Medical Dialogue
Guangya Yu | Hui Luo | Qi Ye | Ruihui Hou | Weiyan Zhang | Mingxi Shang | Xuanwu Li | ChunMing Wang | Tong Ruan
Findings of the Association for Computational Linguistics: ACL 2026
Guangya Yu | Hui Luo | Qi Ye | Ruihui Hou | Weiyan Zhang | Mingxi Shang | Xuanwu Li | ChunMing Wang | Tong Ruan
Findings of the Association for Computational Linguistics: ACL 2026
With the remarkable performance of large language models (LLMs) in medicine, particularly their ability to support clinical decision-making in medical dialogues, a key limitation remains: the static reasoning patterns derived from human expert experience are often inadequate for the dynamic and diverse nature of real-world multi-turn conversations. While recent large reasoning models (such as R1) enable deeper and more complex thought processes to address such challenges, they also introduce significant redundancy. Meanwhile, recent studies on reusing atomic thoughts demonstrate a practical pathway toward dynamic and precise reasoning in general domains. In this paper, we investigate the role of atomic thought-based experience in medical dialogue tasks. First, we collect human expert clinical experience. Then, we propose a novel distillation framework that extracts atomic thoughts from teacher models and reuses them to guide reasoning and generate responses. Based on this framework, we construct training data from ReMeDi and fine-tune student models, which demonstrate enhanced performance in both static and interactive medical dialogue scenarios. Furthermore, we examine the impact of experience across various models, datasets, and scenarios. Crucially, transferring this experience empowers weaker models to generate high-quality reasoning data, matching the annotation capabilities of stronger LLMs while significantly reducing costs. The code is available in this repository https://github.com/VioletAmethystLunar/Atomic-Thoughts-Medical-Dialogue.
Balancing Knowledge Breadth and Task Depth for Effective Domain Adaptation Fine-Tuning
Mu Zhang | Yuxiang Chu | Guangya Yu | Yongqi Fan | Weiyan Zhang | Hang Hu | Tong Ruan | Jingping Liu
Findings of the Association for Computational Linguistics: ACL 2026
Mu Zhang | Yuxiang Chu | Guangya Yu | Yongqi Fan | Weiyan Zhang | Hang Hu | Tong Ruan | Jingping Liu
Findings of the Association for Computational Linguistics: ACL 2026
Training large language models for domain adaptation poses a significant challenge in balancing the acquisition of domain knowledge with the retention of general abilities, often leading to catastrophic forgetting. While curriculum learning offers a promising direction, conventional methods typically rely on a single dimension of knowledge or task, which is insufficient to navigate the trade-off between knowledge breadth and task depth. In this paper, we propose a two-dimensional curriculum learning framework that coordinates model training along two orthogonal axes: the knowledge dimension and the task dimension. We first reconstruct the dataset by clustering instances according to their semantic similarity to general-domain data, and subsequently annotate them with a task hierarchy. Then, we design an integrated curriculum that develops from general to domain-specific knowledge clusters, and within each cluster, from lower- to higher-order cognitive tasks. Compared with the second-best method, our method improves accuracy on medical evaluations by 2.49% and on financial evaluations by 1.2%. Ablation and cross-domain experiments further demonstrate our method as a scalable and effective framework for structured domain adaptation in large language model fine-tuning. We have released the code in an anonymous repository at https://github.com/Melo-1017/Balancing-Knowledge-Breadth-and-Task-Depth.
2025
Can Multimodal Large Language Models Understand Spatial Relations?
Jingping Liu | Ziyan Liu | Zhedong Cen | Yan Zhou | Yinan Zou | Weiyan Zhang | Haiyun Jiang | Tong Ruan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingping Liu | Ziyan Liu | Zhedong Cen | Yan Zhou | Yinan Zou | Weiyan Zhang | Haiyun Jiang | Tong Ruan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model’s prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://huggingface.co/datasets/liuziyan/SpatialMQA.
LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review
Cheng Yuan | Xinkai Rui | Yongqi Fan | Yawei Fan | Boyang Zhong | Jiacheng Wang | Weiyan Zhang | Tong Ruan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Cheng Yuan | Xinkai Rui | Yongqi Fan | Yawei Fan | Boyang Zhong | Jiacheng Wang | Weiyan Zhang | Tong Ruan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Despite the remarkable performance of Large Language Models (LLMs) in automated discharge summary generation, they still suffer from generating inaccurate content or fabricating information without valid sources. To address these issues, we propose LCDS, a tool for empowering LLMs with Logic-Controlled Discharge Summary generation. LCDS constructs a source mapping table by calculating the textual similarity between electronic medical records (EMRs) and discharge summaries, providing a structured reference for generation. Based on a comprehensive set of logical rules, LCDS identifies the structured writing logic of discharge summaries and integrates it with EMRs to generate silver discharge summaries. Furthermore, LCDS traces the provenance of generated content, allowing experts to review, provide feedback, and rectify errors to produce golden discharge summaries, which are subsequently recorded for the incremental fine-tuning of LLMs.Our project and demo video are in the GitHub repository https://github.com/ycycyc02/LCDS.
CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation
Guangya Yu | Yanhao Li | Zongying Jiang | Yuxiong Jin | Li Dai | Yupian Lin | Ruihui Hou | Weiyan Zhang | Yongqi Fan | Qi Ye | Jingping Liu | Tong Ruan
Findings of the Association for Computational Linguistics: ACL 2025
Guangya Yu | Yanhao Li | Zongying Jiang | Yuxiong Jin | Li Dai | Yupian Lin | Ruihui Hou | Weiyan Zhang | Yongqi Fan | Qi Ye | Jingping Liu | Tong Ruan
Findings of the Association for Computational Linguistics: ACL 2025
Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCIC and propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances and 76 indicators. (2) We propose a semi-automatic method to enhance the rule representation. Then we propose the Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions. (3) We conduct comprehensive experiments on 20 representative LLMs, covering general and medical models. Our findings reveal that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks. (4) We conduct an error analysis and investigate the capabilities of clinical fact verification and inferential rule reasoning, providing insights to improve performance in the MQCIC further. The dataset and code is available in this repository https://github.com/YuY-2001/C-MQCIC.
2024
Unexpected Phenomenon: LLMs’ Spurious Associations in Information Extraction
Weiyan Zhang | Wanpeng Lu | Jiacheng Wang | Yating Wang | Lihan Chen | Haiyun Jiang | Jingping Liu | Tong Ruan
Findings of the Association for Computational Linguistics: ACL 2024
Weiyan Zhang | Wanpeng Lu | Jiacheng Wang | Yating Wang | Lihan Chen | Haiyun Jiang | Jingping Liu | Tong Ruan
Findings of the Association for Computational Linguistics: ACL 2024
Information extraction plays a critical role in natural language processing. When applying large language models (LLMs) to this domain, we discover an unexpected phenomenon: LLMs’ spurious associations. In tasks such as relation extraction, LLMs can accurately identify entity pairs, even if the given relation (label) is semantically unrelated to the pre-defined original one. To find these labels, we design two strategies in this study, including forward label extension and backward label validation. We also leverage the extended labels to improve model performance. Our comprehensive experiments show that spurious associations occur consistently in both Chinese and English datasets across various LLM sizes. Moreover, the use of extended labels significantly enhances LLM performance in information extraction tasks. Remarkably, there is a performance increase of 9.55%, 11.42%, and 21.27% in F1 scores on the SciERC, ACE05, and DuEE datasets, respectively.
Search
Fix author
Co-authors
- Tong Ruan 7
- Jingping Liu 4
- Guangya Yu 4
- Yongqi Fan 3
- Jiacheng Wang 3
- Ruihui Hou 2
- Haiyun Jiang 2
- ChunMing Wang 2
- Qi Ye 2
- Xuhao Bao 1
- Zhedong Cen 1
- Lihan Chen 1
- Yuxiang Chu 1
- Li Dai 1
- Yawei Fan 1
- Hang Hu 1
- Zongying Jiang 1
- Jihang Jin 1
- Yuxiong Jin 1
- Yanhao Li 1
- Xuanwu Li 1
- Yupian Lin 1
- Ziyan Liu 1
- Wanpeng Lu 1
- Hui Luo 1
- Xinkai Rui 1
- Mingxi Shang 1
- Yating Wang 1
- Yinan Wu 1
- Hanjing Yan 1
- Cheng Yuan 1
- Mu Zhang 1
- Boyang Zhong 1
- Yan Zhou 1
- Yinan Zou 1