Kang Liu
Other people with similar names: Kang Liu
Unverified author pages with similar names: Kang Liu
2026
GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks
Jianwen Luo | Yiming Huang | Jinxiang Meng | Fangyu Lei | Shizhu He | Xiao Liu | Shanshan Jiang | Bin Dong | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jianwen Luo | Yiming Huang | Jinxiang Meng | Fangyu Lei | Shizhu He | Xiao Liu | Shanshan Jiang | Bin Dong | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have shown great promise in tool-making, yet existing frameworks often struggle to efficiently construct reliable toolsets and are limited to single-task settings. To address these challenges, we propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that dynamically constructs and evolves a hierarchical graph of reusable tools across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, TabMWP). Our results show that GATE achieves up to 4.3× faster milestone completion in Minecraft compared to the previous state-of-the-art method, and provides an average improvement of 9.23% over existing tool-making methods in code generation tasks and 10.03% in agent tasks. Further analysis shows that GATE exhibits strong adaptive evolution capabilities, effectively balancing tool quantity, complexity, and functionality while maintaining high efficiency. Code and data are available at https://github.com/ayanami2003/GATE.
Reinforcing Agentic Search Via Reward Density Optimization
Kun Luo | Hongjin Qian | Zheng Liu | Ziyi Xia | Shitao Xiao | Zhao Cao | Siqi Bao | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kun Luo | Hongjin Qian | Zheng Liu | Ziyi Xia | Shitao Xiao | Zhao Cao | Siqi Bao | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic search. However, its performance is often hindered by reward sparsity, whereby agents receive very limited positive feedback despite incurring significant exploration costs. In this paper, we formalize this challenge as a new research problem termed **Reward Density Optimization**, which aims to improve the reward obtained per unit of exploration cost. To address this problem, we introduce InfoFlow, a systematic framework that operates along three complementary dimensions: 1) **Sub-goal Scaffolding**: which decomposes long-horizon tasks into intermediate objectives and assigns process-level rewards to provide denser learning signals; 2) **Pathfinding Hints**: which injects corrective guidance into stalled trajectories to increase the ratio of successful trials; and 3) **Dual-agent Refinement**: which employs a dual-agent architecture to offload the cognitive burden of deep exploration. We evaluate InfoFlow on several popular agentic search benchmarks, where it significantly outperforms strong baselines and enables lightweight LLMs to achieve performance comparable to that of advanced proprietary models.
Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do
Zhuoran Jin | Kejian Zhu | Hongbang Yuan | Yupu Hao | Pengfei Cao | Yubo Chen | Kang Liu | Jun Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuoran Jin | Kejian Zhu | Hongbang Yuan | Yupu Hao | Pengfei Cao | Yubo Chen | Kang Liu | Jun Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chain-of-Thought (CoT) has become a standard method for improving reasoning capabilities in large language models (LLMs) by eliciting step-by-step thinking, but its effectiveness in multimodal tasks remains unclear. In this paper, we aim to systematically investigate the key question: What can multimodal Chain-of-Thought reasoning do, and where and why does it fall short? To this end, we evaluate 12 multimodal tasks across perception and reasoning categories using both 14 non-reasoning models and 8 reasoning models. Our analysis reveals several important findings: (1) CoT is not a free lunch and should be used selectively depending on the specific requirements of each task. For perception tasks, CoT can lead to undesirable side effects, such as reduced performance in visual grounding and object counting. In contrast, it proves effective for reasoning tasks involving mathematical, scientific, and multi-image reasoning; (2) Compared to original models, existing open-source multimodal reasoning models often yield only marginal overall improvements, possibly due to an overemphasis on mathematical reasoning at the expense of broader capabilities; (3) Visual reasoning remains a key bottleneck for current multimodal CoT, as models exhibit a Look Light, Think Heavy” pattern where verbal reflection rises and falls during reasoning, whereas visual reflection consistently diminishes. These findings suggest that while multimodal CoT handles verbal reflection relatively well, it lacks the ability to maintain deep visual introspection throughout the reasoning process.
Spectral Disentanglement: Rank-Aware Task Adaptation for Rehearsal-free Continual Learning in LLMs
Huanxuan Liao | Shizhu He | Yupu Hao | Yequan Wang | Wenhao Teng | Xiangwen Liao | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Huanxuan Liao | Shizhu He | Yupu Hao | Yequan Wang | Wenhao Teng | Xiangwen Liao | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Continual Learning (CL) for Large Language Models (LLMs) faces a fundamental Stability-Plasticity Dilemma: balancing the plasticity to acquire new capabilities with the stability to preserve prior knowledge. While Parameter-Efficient Fine-Tuning methods, such as LoRA, enable efficient adaptation, we identify a critical flaw in current approaches termed Rank-Blindness: the enforcement of a single rank constraint across diverse tasks, which entangles task-shared and task-specific knowledge, leading to catastrophic forgetting of earlier tasks and underfitting on complex new ones. To address this, we propose SpaRTA, a novel rehearsal-free framework guided by a rank-spectrum perspective that explicitly disentangles knowledge into two orthogonal subspaces. Specifically, SpaRTA employs a low-rank branch to capture task-shared representations and a high-rank branch to model task-specific features. To integrate these complementary representations, we introduce a context-aware dynamic router that adaptively fuses the two branches based on input semantics, while an explicit orthogonality constraint minimizes interference between shared and specific parameter subspaces. This design effectively isolates task-specific updates from shared knowledge, preventing the overwriting of prior capabilities while preserving strong adaptation capacity. Extensive experiments demonstrate that SpaRTA achieves a superior stability-plasticity balance compared to single-rank baselines. Notably, the proposed spectral disentanglement strategy substantially reduces inter-task interference and yields strong zero-shot generalization on unseen tasks. Our code will be available at https://github.com/Xnhyacinth/SpaRTA.
Theory-optimal Quantization Based on Flatness
Xiusheng Huang | Zhe Li | Xuanwu Yin | Lu Wang | Yequan Wang | Dong Li | Emad Barsoum | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiusheng Huang | Zhe Li | Xuanwu Yin | Lu Wang | Yequan Wang | Dong Li | Emad Barsoum | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal Quantization (BDQ), a novel post-training quantization framework that effectively disperses outlier patterns through optimized matrix transformations. BDQ strategically distributes outlier magnitudes across matrix dimensions via learned diagonal operations. Extensive experiments demonstrate that BDQ establishes a new quantization benchmark. It achieves less than 1% accuracy drop in W4A4 quantization on the LLaMA-3-8B model. In the more challenging W2A4KV16 experiment, compared to state-of-the-art approaches, BDQ reduces the performance gap by 39.1% on the DeepSeek-R1-Distill-LLaMA-70B model.
Towards Explainable Diagnosis: A Self-learned Explanatory Knowledge Base Approach
Dongqi Huang | Tong Zhou | Zhuoran Jin | Shenghui Shi | Maoyujiao | Kang Liu | Jun Zhao | Yubo Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dongqi Huang | Tong Zhou | Zhuoran Jin | Shenghui Shi | Maoyujiao | Kang Liu | Jun Zhao | Yubo Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Explainable diagnosis requires that authoritative medical knowledge provide the rationales linking a patient’s clinical manifestations to the diagnostic conclusion. Although large language models (LLMs) hold great potential to facilitate explainable diagnosis, their effectiveness is often constrained by insufficient diagnostic expertise. To address this limitation, we propose Self-learned Explainable Knowledge Augmented Diagnosis (SEKAD), a unified LLM-based framework for faithful and explainable diagnosis. Our approach builds a high-quality diagnostic knowledge base through a record-driven explanation learning paradigm, as well as applies this knowledge via an explanation-based diagnostic process that ensures faithful inference. Experiments on the DiReCT and JAMA benchmarks show that SEKAD consistently outperforms strong baselines across the metrics. In particular, on the DiReCT benchmark, SEKAD improves the explanation completeness metric from 64.5% to 76.9% over the best existing methods, highlighting its effectiveness in enhancing diagnostic explainability and showing that our text mining approach produces knowledge that is both reliable in quality and large in quantity.
Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning
Tianyi Men | Zhuoran Jin | Pengfei Cao | Yubo Chen | Kang Liu | Jun Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianyi Men | Zhuoran Jin | Pengfei Cao | Yubo Chen | Kang Liu | Jun Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open-source MLLMs are cost-efficient and privacy-preserving compared with commercial large models, they suffer from weak planning and limited cross-website generalization. To address these limitations, we introduce the planning experience exploration and utilization (PEEU) method, which autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high-level training data. To quantitatively analyze the generalization behaviors driving this performance, we propose the task decomposition hierarchical analysis framework (TDHAF) to systematically study compositional generalization across three task granularities: low, middle and high levels. Our analysis reveals that mastering low-level atomic skills does not guarantee high-level planning competence, while high-level task training yields stronger OOD generalization. Experiments on real-world benchmarks demonstrate PEEU’s superior effectiveness: our 7B model achieves 30.6% accuracy, outperforming the much larger Qwen2.5-VL-32B model. These demonstrate constructing hindsight high-level tasks and leveraging experiences is crucial for OOD planning abilities of small MLLMs.
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs
Siqi Fan | Xiusheng Huang | Yiqun Yao | Xuezhi Fang | Kang Liu | Peng Han | Shuo Shang | Aixin Sun | Yequan Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Siqi Fan | Xiusheng Huang | Yiqun Yao | Xuezhi Fang | Kang Liu | Peng Han | Shuo Shang | Aixin Sun | Yequan Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors—hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LifeState-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets—Hamlet and a synthetic script collection—rich in narrative structure and character interactions. Our fact-checking evaluation probes models’ self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that non-parametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.
Hetero-Designer: Automated Design of Multi-Agent Systems with Heterogeneous LLMs
Zhiheng Zhang | Yuanzhe Zhang | Bohan Yu | Daojian Zeng | Kang Liu | Jun Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Zhang | Yuanzhe Zhang | Bohan Yu | Daojian Zeng | Kang Liu | Jun Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
LLM-based Multi-agent systems (MAS) have shown strong capabilities across a wide range of domains. Their success largely hinges on the collaboration topology design, which has emerged as a central research focus in the automated MAS design.However, existing approaches are fundamentally constrained by their reliance on homogeneous LLMs, which significantly limits overall system intelligence.In response to this limitation, we for the first time propose the concept of **Automated Design of Heterogeneous-LLMs-based MAS (ADHM)**.ADHM sheds light on a promising avenue for advancing collective intelligence, which focuses on the automated design of cost-effective MAS composed of diverse LLMsand roles to suit various queries.Toward this challenging goal, we propose **Hetero-Designer**, a novel pipeline that efficiently encodes intricate dependencies among queries, LLMs and roles through a novel Binary-Star Transformer and constructs Hetero-MAS in an autoregressive graph generation process. Extensive experiments demonstrate that **Hetero-Designer** is: (1) high-performing on various benchmarks, (2) economical in reducing overhead, (3) extensible to unseen LLMs and roles.
Shuttle Between Symbolic Instructions and Neural Parameters of Large Language Models
Wangtao Sun | Haotian Xu | Huanxuan Liao | Xuanqing Yu | Zhongtao Jiang | Shizhu He | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Wangtao Sun | Haotian Xu | Huanxuan Liao | Xuanqing Yu | Zhongtao Jiang | Shizhu He | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper notices that while symbolic instruction and neural parameters play different roles on steering LLMs’ behavior, both instructions and parameters are the compression of task data, they are supposed be strongly correlated and can be learned to predict one from the other. Therefore, This paper proposes a novel neural network framework, SHIP (Shuttle between the Instructions and the Parameters), to model and learn the bi-directional mappings between the instructions and the parameters of LLMs. We verify that SHIP can effectively map one of the instructions/parameters to the other by evaluating it on the tasks of instruction deduction and induction. The results show that SHIP performs better than existing baseline methods in terms of deductive capabilities while significantly surpassing them in inductive capabilities. Moreover, SHIP can effectively combine the two mapping processes to perform excellent inductive reasoning. We further discuss how the latent fusing methods and latent dimensions affect SHIP’s performance, and show SHIP can effectively generalize with pre-training. The code and data for this paper are released at https://anonymous.4open.science/r/Shuttle-Between-Instructions-Parameters
Harmonizing the Past, Present, and Future: A Null-Space Constrained Region-Specific Method for Continual Learning in LLMs
Jinhui Chen | Shizhu He | Xingchang Yang | Huanxuan Liao | Yequan Wang | Xiangwen Liao | Wenhao Teng | Kang Liu | Jun Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jinhui Chen | Shizhu He | Xingchang Yang | Huanxuan Liao | Yequan Wang | Xiangwen Liao | Wenhao Teng | Kang Liu | Jun Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Enabling Large Language Models (LLMs) to evolve sustainably requires simultaneously preserving previously acquired knowledge (Past), effectively acquiring new task-specific skills (Present), and reserving sufficient parameter capacity for subsequent adaptation (Future). However, existing continual learning (CL) paradigms often prioritize immediate performance through dense updates, leading to catastrophic forgetting and rapid exhaustion of model capacity. To harmonize these conflicting demands, we draw inspiration from the brain’s functional partitioning and propose the Null-Space Constrained Parameter Region Specificity Method (PaRSP). PaRSP establishes a dynamic "Task-Region Mapping" that distinguishes between specialized neurons and generalist neurons. By precisely localizing a sparse "functional core" for each task, PaRSP restricts updates to specific regions via null-space orthogonality, preserving the vast majority of the network as an immutable "long-term memory bank." This induced sparsity not only enhances plasticity via targeted adaptation and minimizes interference to ensure stability, but also strategically reserves substantial capacity, securing sustainability for future evolution. Extensive experiments validate PaRSP’s state-of-the-art performance, particularly on Standard CL and Long Sequence benchmarks, effectively harmonizing the stability-plasticity-sustainability trade-off. Code is available at https://github.com/JinhuiBot/PaRSP
Efficient Prior-Guided Reasoning for Robust Retrieval-Augmented Generation under Conflicts
Xiaowei Yuan | Ziyang Huang | Zhao Yang | Yequan Wang | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiaowei Yuan | Ziyang Huang | Zhao Yang | Yequan Wang | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval-Augmented Generation (RAG) has become a standard paradigm for grounding Large Language Models (LLMs) with external knowledge. However, RAG performance often degrades substantially when faced with noisy, outdated, or conflicting retrieved information. In this work, we empirically demonstrate that Prior-Guided Reasoning—a strategy that explicitly elicits the model’s parametric knowledge as prior information to guide reasoning on retrieved documents—effectively mitigates the impact of external conflicts. Building on this, we propose BrPr (Bernoulli-gated reinforcement learning for Prior-Guided reasoning), a framework that achieves robust performance across varying degrees of external inconsistency. Furthermore, by employing a Bernoulli-gated dropout mechanism during training, BrPr distills the prior-driven reasoning capability into the model parameters, enabling efficient latent reasoning without explicit prior generation. The experimental results demonstrate that BrPr consistently exhibits superior robustness to external conflicts and noise.
R³A: Reinforced Reasoning for Relevance Assessment for RAG in User-Generated Content Platforms
Xiaowei Yuan | Lei Jin | Haoxin Zhang | Ziyang Huang | Yan Gao | Yiwu | Yao Hu | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Xiaowei Yuan | Lei Jin | Haoxin Zhang | Ziyang Huang | Yan Gao | Yiwu | Yao Hu | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Retrieval-augmented generation (RAG) plays a critical role in user-generated content (UGC) platforms, but its effectiveness critically depends on accurate query–document relevance assessment. Despite recent advances in applying large language models (LLMs) to relevance modeling, UGC platforms present unique challenges: 1) ambiguous user intent due to sparse user feedback in RAG scenarios, and 2) asymmetric relevance, where relevance is driven by localized answer-bearing content rather than global query–document similarity. To address these issues, we propose the Reinforced Reasoning model for Relevance Assessment (R³A), which decomposes relevance assessment into intent inference and evidence grounding. R³A leverages auxiliary high-clicked documents to infer latent query intent, and extracts verbatim evidence fragments to ground relevance decisions, reducing noise sensitivity and improving asymmetric relevance modeling. Experimental results demonstrate that R³A substantially outperforms strong baselines on offline benchmarks, while the distilled R³A-1.5B model achieves significant gains in large-scale online A/B testing, effectively balancing performance and practical deployability.
2025
Multilingual Knowledge Graph Completion via Efficient Multilingual Knowledge Sharing
Cunli Mao | Xiaofei Gao | Ran Song | Shizhu He | Shengxiang Gao | Kang Liu | Zhengtao Yu
Findings of the Association for Computational Linguistics: EMNLP 2025
Cunli Mao | Xiaofei Gao | Ran Song | Shizhu He | Shengxiang Gao | Kang Liu | Zhengtao Yu
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) based Multilingual Knowledge Graph Completion (MKGC) aim to predict missing facts by leveraging LLMs’ multilingual understanding capabilities, improving the completeness of multilingual knowledge graphs (KGs).However, existing MKGC research underutilizes the multilingual capabilities of LLMs and ignores the shareability of cross-lingual knowledge.In this paper, we propose a novel MKGC framework that leverages multilingual shared knowledge to significantly enhance performance through two components: Knowledge-level Grouped Mixture of Experts (KL-GMoE) and Iterative Entity Reranking (IER).KL-GMoE efficiently models shared knowledge, while IER significantly enhances its utilization.To evaluate our framework, we constructed a mKG dataset containing 5 languages and conducted comprehensive comparative experiments with existing state-of-the-art (SOTA) MKGC method.The experimental results demonstrate that our framework achieves improvements of 5.47%, 3.27%, and 1.01% in the Hits@1, Hits@3, and Hits@10 metrics, respectively, compared with SOTA MKGC method.Further experimental analysis revealed the properties of knowledge sharing in settings of unseen and unbalanced languages.We have released the dataset and code for our work on https://github.com/gaoxiaofei07/KL-GMoE.
MotivGraph-SoIQ: Integrating Motivational Knowledge Graphs and Socratic Dialogue for Enhanced LLM Ideation
Xinping Lei | Tong Zhou | Yubo Chen | Kang Liu | Jun Zhao
Findings of the Association for Computational Linguistics: EMNLP 2025
Xinping Lei | Tong Zhou | Yubo Chen | Kang Liu | Jun Zhao
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs) hold significant promise for accelerating academic ideation but face critical challenges in grounding ideas and mitigating confirmation bias during refinement. To address these limitations, we propose MotivGraph-SoIQ, a novel framework that enhances LLM ideation by integrating a Motivational Knowledge Graph (MotivGraph), which provides essential grounding from research literature, with a Q-Driven Socratic Ideator. The Ideator, a dual-agent system utilizing Socratic questioning, facilitates a rigorous refinement process that mitigates confirmation bias and significantly improves idea quality across dimensions of novelty, experimental feasibility, and motivation. Our experimental results demonstrate MotivGraph-SoIQ’s effectiveness. Comparative studies show significant quantitative improvements over SOTA methods across LLM-based scoring, ELO ranking, and human evaluation. Ablation studies further validate the crucial contributions of both the MotivGraph for enhancing idea novelty and practicality, and the Socratic dialogue with the teacher agent for substantial quality improvement. This work underscores the potential of combining structured knowledge with interactive, critique-based refinement for robust LLM ideation.
Why and How LLMs Benefit from Knowledge Introspection in Commonsense Reasoning
Chengfeng Zhao | Shizhu He | Shanshan Jiang | Bin Dong | Jun Zhao | Kang Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Chengfeng Zhao | Shizhu He | Shanshan Jiang | Bin Dong | Jun Zhao | Kang Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) can improve commonsense reasoning through generating intermediate knowledge. However, the effectiveness of this knowledge introspection is not always guaranteed. This paper first systematically investigates and reveals an **introspection paradox**: while simple introspection tends to benefit weaker models, it often degrades the performance of stronger ones, particularly on simpler tasks. Our deep analysis indicates that this paradox arises from a complex interplay among model capability, task difficulty and the quality of generated knowledge. Further interpretability analysis reveals the origins of low-quality knowledge generation. To better employ introspected knowledge in LLM, this paper proposes a training-free **Adaptive Introspection Strategy** that operates in two stages using only the model’s internal states: **Knowledge Detection**, which dynamically identifies and discards potentially low-quality knowledge, and **Knowledge Regeneration**, which employs attention smoothing to guide the model away from harmful failure modes during knowledge generation. Extensive experiments on five Llama models with different sizes and eight commonsense reasoning benchmarks demonstrate that our approach effectively mitigates the limitations of standard introspection and has consistent performance gains across almost all settings.
M2Edit: Locate and Edit Multi-Granularity Knowledge in Multimodal Large Language Model
Yang Zhou | Pengfei Cao | Yubo Chen | Qingbin Liu | Dianbo Sui | Xi Chen | Kang Liu | Jun Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yang Zhou | Pengfei Cao | Yubo Chen | Qingbin Liu | Dianbo Sui | Xi Chen | Kang Liu | Jun Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multimodal knowledge editing is an important method for modifying outdated or incorrect knowledge in Multimodal Large Language Models (MLLMs). However, existing datasets for multimodal knowledge editing lack multi-granularity knowledge. In this paper, we present a more realistic dataset called M2Edit, which includes three distinct types of knowledge: entity, relation, and action. Additionally, existing knowledge editing methods for MLLMs lack the ability to handle multi-granularity knowledge and generalize to multimodal data. To address these limitations, we propose the multimodal knowledge editing method MLE. This approach identifies key knowledge layers within different components and collaboratively edits the various components of MLLMs. As a result, we observe significant improvements in visual generality performance, ranging from 4.8 to 10.8, and achieve the best overall performance on knowledge data of different granularities.
Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models
Wangtao Sun | ChenxiangZhang ChenxiangZhang | XueYou Zhang | Xuanqing Yu | Ziyang Huang | Haotian Xu | Shizhu He | Jun Zhao | Kang Liu
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
Wangtao Sun | ChenxiangZhang ChenxiangZhang | XueYou Zhang | Xuanqing Yu | Ziyang Huang | Haotian Xu | Shizhu He | Jun Zhao | Kang Liu
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
"Although Large Language Models (LLMs) have demonstrated strong instruction-following abil-ity, they are further supposed to be controlled and guided by inferential rules in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of inferential rule-following capability of LLMs. However, no prior work has made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT). The experimental results show that through IRFT, LLMs can learn abstract inferential rule-following abilities from purely synthetic data and then generalize to RuleBench. The data and code can be found at:https://gitee.com/forangel2014/llm-rule-following-code"
Search
Fix author
Co-authors
- Jun Zhao 11
- Shizhu He (何世柱) 7
- Yequan Wang 5
- Jun Zhao 4
- Pengfei Cao (鹏飞 曹) 3
- Yubo Chen 3
- Ziyang Huang 3
- Zhuoran Jin 3
- Huanxuan Liao 3
- Yubo Chen 2
- Bin Dong 2
- Yupu Hao 2
- Xiusheng Huang 2
- Shanshan Jiang 2
- Xiangwen Liao 2
- Wangtao Sun 2
- Wenhao Teng 2
- Xuanqing Yu 2
- Xiaowei Yuan 2
- Tong Zhou 2
- Siqi Bao 1
- Emad Barsoum 1
- Zhao Cao 1
- Jinhui Chen 1
- Xi Chen 1
- ChenxiangZhang ChenxiangZhang 1
- Siqi Fan 1
- Xuezhi Fang 1
- Shengxiang Gao 1
- Xiaofei Gao 1
- Yan Gao 1
- Peng Han 1
- Yao Hu 1
- Dongqi Huang 1
- Yiming Huang 1
- Zhongtao Jiang 1
- Lei Jin 1
- Fangyu Lei 1
- Xinping Lei 1
- Dong Li 1
- Zhe Li 1
- Qingbin Liu 1
- Xiao Liu 1
- Zheng Liu 1
- Jianwen Luo 1
- Kun Luo 1
- Cunli Mao 1
- Maoyujiao 1
- Tianyi Men 1
- Jinxiang Meng 1
- Hongjin Qian 1
- Shuo Shang 1
- Shenghui Shi 1
- Ran Song 1
- Dianbo Sui 1
- Aixin Sun 1
- Lu Wang 1
- Ziyi Xia 1
- Shitao Xiao 1
- Haotian Xu 1
- Haotian Xu 1
- Xingchang Yang 1
- Zhao Yang 1
- Yiqun Yao 1
- Xuanwu Yin 1
- Yiwu 1
- Bohan Yu 1
- Zhengtao Yu (余正涛) 1
- Hongbang Yuan 1
- Daojian Zeng 1
- Haoxin Zhang 1
- XueYou Zhang 1
- Yuanzhe Zhang 1
- Zhiheng Zhang 1
- Chengfeng Zhao 1
- Yang Zhou 1
- Kejian Zhu 1