Tingyang Xu


2025

Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts.To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline.ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier.Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results.Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential.The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.
The conceptual knowledge in Large Language Models (LLMs) can become outdated over time, and concept editing is often an option. Current evaluations on conceptual knowledge editing primarily focus on whether the definitions of concepts are successfully edited, neglecting the impact on the model’s related beliefs. To address this gap, we introduce a benchmark called RelEdit, which includes criteria and questions to assess both concept-level and instance-level relational reasoning abilities of edited models. Our findings reveal that existing knowledge editing methods struggle to reason about related conceptual knowledge effectively. Additionally, we introduce a simple memory-based in-context editing baseline, MICE, which prompts the language model to generate answers that align with the stored edited concepts in external memory. In addition, we find that MICE obtains the best scores on our benchmark, suggesting a promising research direction for model editing.