Linhao Luo

2026

Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing
Shigeng Chen | Linhao Luo | Zhangchi Qiu | Yanan Cao | Carl Yang | Shirui Pan
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge editing (KE) has recently emerged as a promising technique to update specific facts in large language models (LLMs) without full retraining. While existing KE methods show promising results on general-domain benchmarks, their effectiveness in the medical domain remains largely unexplored. Medical knowledge editing poses unique challenges, requiring models not only to memorize new facts but also to internalize and generalize them for reliable and interpretable clinical decision-making. In this work, we propose MedEditBench, a rigorous evaluation framework for assessing medical knowledge editing. Our preliminary results reveal that current KE paradigm, which directly edits simple answers to the LLMs, often leads to superficial updates with poor generalization. To address this, we introduce Self-Generated Rationale Editing (SGR-Edit), which leverages model-generated rationales as editing targets, enabling deeper knowledge integration. Extensive experiments across diverse LLMs and KE methods demonstrate that SGR-Edit consistently improves editing efficacy and generalization. Furthermore, we examine the impact of sequential edits on in-domain medical knowledge, external-domain knowledge, as well as general model capabilities, offering practical insights for deploying KE in real-world medical applications.

2025

pdf bib abs

Continual Learning of Large Language Models
Tongtong Wu | Trang Vu | Linhao Luo | Gholamreza Haffari
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

As large language models (LLMs) continue to expand in size and utility, keeping them current with evolving knowledge and shifting user preferences becomes an increasingly urgent yet challenging task. This tutorial offers a comprehensive exploration of continual learning (CL) in the context of LLMs, presenting a structured framework that spans continual pre-training, instruction tuning, and alignment. Grounded in recent survey work and empirical studies, we discuss emerging trends, key methods, and practical insights from both academic research and industry deployments. In addition, we highlight the new frontier of lifelong LLM agents, i.e., systems capable of autonomous, self-reflective, and tool-augmented adaptation. Participants will gain a deep understanding of the computational, algorithmic, and ethical challenges inherent to CL in LLMs, and learn about strategies to mitigate forgetting, manage data and evaluation pipelines, and design systems that can adapt responsibly and reliably over time. This tutorial will benefit researchers and practitioners interested in advancing the long-term effectiveness, adaptability, and safety of foundation models.

2024

Norm violations occur when individuals fail to conform to culturally accepted behaviors, which may lead to potential conflicts. Remediating norm violations requires social awareness and cultural sensitivity of the nuances at play. To equip interactive AI systems with a remediation ability, we offer ReNoVi — a large-scale corpus of 9,258 multi-turn dialogues annotated with social norms, as well as define a sequence of tasks to help understand and remediate norm violations step by step. ReNoVi consists of two parts: 512 human-authored dialogues (real data), and 8,746 synthetic conversations generated by ChatGPT through prompt learning. While collecting sufficient human-authored data is costly, synthetic conversations provide suitable amounts of data to help mitigate the scarcity of training data, as well as the chance to assess the alignment between LLMs and humans in the awareness of social norms. We thus harness the power of ChatGPT to generate synthetic training data for our task. To ensure the quality of both human-authored and synthetic data, we follow a quality control protocol during data collection. Our experimental results demonstrate the importance of remediating norm violations in socio-cultural conversations, as well as the improvement in performance obtained from synthetic data.

pdf bib abs

Large language models (LLMs) have demonstrated strong reasoning abilities when prompted to generate chain-of-thought (CoT) explanations alongside answers. However, previous research on evaluating LLMs has solely focused on answer accuracy, neglecting the correctness of the generated CoT. In this paper, we delve deeper into the CoT reasoning capabilities of LLMs in multi-hop question answering by utilizing knowledge graphs (KGs). We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs’ knowledge of reasoning and the accuracy of the generated CoT. Through experiments conducted on 5 different families of LLMs across 2 multi-hop question-answering datasets, we find that LLMs possess sufficient knowledge to perform reasoning. However, there exists a significant disparity between answer accuracy and faithfulness of the CoT generated by LLMs, indicating that they often arrive at correct answers through incorrect reasoning.

2023

pdf bib abs

Systematic Assessment of Factual Knowledge in Large Language Models
Linhao Luo | Trang Vu | Dinh Phung | Reza Haf
Findings of the Association for Computational Linguistics: EMNLP 2023

Previous studies have relied on existing question-answering benchmarks to evaluate the knowledge stored in large language models (LLMs). However, this approach has limitations regarding factual knowledge coverage, as it mostly focuses on generic domains which may overlap with the pretraining data. This paper proposes a framework to systematically assess the factual knowledge of LLMs by leveraging knowledge graphs (KGs). Our framework automatically generates a set of questions and expected answers from the facts stored in a given KG, and then evaluates the accuracy of LLMs in answering these questions. We systematically evaluate the state-of-the-art LLMs with KGs in generic and specific domains. The experiment shows that ChatGPT is consistently the top performer across all domains. We also find that LLMs performance depends on the instruction finetuning, domain and question complexity and is prone to adversarial context.

Co-authors

Venues

Fix author