Yuchen Wu

2026

Reason-KE++: Aligning the Process, Not Just the Outcome, for Faithful LLM Knowledge Editing
Yuchen Wu | Liang Ding | Li Shen | Dacheng Tao
Findings of the Association for Computational Linguistics: ACL 2026

Aligning Large Language Models (LLMs) to be faithful to new knowledge in complex, multi-hop reasoning tasks is a critical, yet unsolved, challenge. We find that SFT-based methods, e.g., Reason-KE, while state-of-the-art, suffer from a "faithfulness gap": they optimize for format mimicry rather than sound reasoning. This gap enables the LLM’s powerful parametric priors to override new contextual facts, resulting in critical factual hallucinations (e.g., incorrectly reasoning "Houston" from "NASA" despite an explicit edit). To solve this core LLM alignment problem, we propose **Reason-KE++**, an SFT+RL framework that instills process-level faithfulness. Its core is a Stage-aware Reward mechanism that provides dense supervision for intermediate reasoning steps (e.g., Decomposition, Sub-answer Correctness). Crucially, we identify that naive outcome-only RL is a deceptive trap for LLM alignment: it collapses reasoning integrity (e.g., 19.00% Hop acc) while superficially boosting final accuracy. Our process-aware framework sets **a new SOTA of 95.48%** on MQUAKE-CF-3k (+5.28%), demonstrating that for complex tasks, aligning the reasoning process is essential for building trustworthy LLMs.

pdf bib abs

Recent advancements in Large Language Models (LLMs) have revolutionized Text-to-SQL parsing, achieving remarkable success in static, single-turn query generation. However, a significant disparity remains between these academic benchmarks and real-world utility. In practical applications, such as financial auditing or business analytics, user intents are rarely static; they evolve dynamically through iterative refinement, necessitating not just information retrieval (SELECT) but continuous state manipulation (INSERT, UPDATE, DELETE). To bridge this gap, we introduce DySQL-Bench, a novel benchmark designed to rigorously evaluate LLMs within a dynamic interaction framework. Unlike varying manual curation efforts, DySQL-Bench employs a two-stage automated synthesis pipeline: transforming raw relational schemas into hierarchical logic trees to generate user-database interactions, followed by a rigorous verify-and-refine protocol that ensures 100% distinct correctness via human expert validation. We further propose an interactive evaluation environment simulating a triadic workflow involving an LLM-simulated user, the agent under test, and an executable database system. Spanning 13 diverse domains with 1,072 complex tasks, our experiments reveal that current powerful models struggle in this realistic setting. Notably, GPT-4o achieves only 58.34% overall accuracy and a meager 23.81% on the strict Pass^5 metric, highlighting the substantial challenges DySQL-Bench poses for the future of database agents.

Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase—the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models’ ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 509K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation.Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.

pdf bib abs

The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called Guided Topology Diffusion (GTD). Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration. Our code is available at https://anonymous.4open.science/r/diffusion_agent-953C.

2025

pdf bib abs

Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs
Yuchen Wu | Liang Ding | Li Shen | Dacheng Tao
Findings of the Association for Computational Linguistics: ACL 2025

Knowledge editing allows for efficient adaptation of large language models (LLMs) to new information or corrections without requiring full retraining. However, prior methods typically focus on either single-language editing or basic multilingual editing, failing to achieve true cross-linguistic knowledge synchronization. To address this, we present a simple and practical state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE), designed to propagate knowledge from a dominant language to other languages effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel dataset to modify in-scope knowledge while preserving unrelated information, and (ii) Target-language Preference Optimization (TL-PO), which applies advanced optimization techniques to ensure consistency across languages, fostering the transfer of updates. Additionally, we contribute a high-quality, cross-lingual dataset, specifically designed to enhance knowledge transfer across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks show that X-KDE significantly enhances cross-lingual performance, achieving an average improvement of +8.19%, while maintaining high accuracy in monolingual settings.

pdf bib abs

Robust Knowledge Editing via Explicit Reasoning Chains for Distractor-Resilient Multi-Hop QA
Yuchen Wu | Liang Ding | Li Shen | Dacheng Tao
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) encode vast amounts of world knowledge but remain static once trained, making timely integration of emerging facts prohibitively expensive via full retraining. Knowledge-editing techniques have thus emerged to inject or overwrite specific facts into LLMs, yet they either over-rely on superficial cues or incur complex, iterative pipelines that collapse under noisy, multi-hop conditions. We introduce **Reason-KE**, an end-to-end reasoning-chain-based editing framework that steers a pretrained LLM through four structured stages—fact acknowledgment, relevance determination, selective application, and final reasoning—to filter distractors in a single pass. Trained on MQuAKE-CF with up to four irrelevant facts, Reason-KE elevates Qwen2.5-7B’s multi-hop QA accuracy to 90.2% (↑17.6 pp) while suffering merely 6.3% drop under heavy distraction and <1% when answers are leaked. Our quantitative analysis confirms Reason-KE’s resilience and efficiency, establishing a new state of the art for reliable LLM knowledge updates. The code will be released.