Yang Wang

Other people with similar names: Yang Wang, Yang Wang

Unverified author pages with similar names: Yang Wang


2026

While current LLM agents utilizing paradigms like ReAct or Plan-and-Solve have established a strong foundation for step-by-step reasoning, they remain brittle in open-ended environments due to two intrinsic limitations: (1) A closed action space: These frameworks are confined to static, pre-defined toolsets, rendering them unable to adapt when required tools are missing or obsolete. (2) Myopic error recovery: Existing agents often get trapped in repetitive local retries, failing to diagnose and rectify root causes within the high-level plan. To overcome these limitations, we introduce CAR (Create And Replan), a novel architecture that incorporates a meta-tool synthesizer to dynamically augment the action space and a reflective replanning mechanism to revise global strategies. To rigorously evaluate our approach, we release ToolHop-Pro, a diagnostic benchmark with systematically pruned toolsets to simulate tool scarcity. Experiments demonstrate that CAR significantly outperforms representative baselines, validating its superior robustness where static agents fail. Code and data are available at https://github.com/Zaiz-77/car.
Multi-agent systems (MAS) built on large language models promise improved problem-solving through collaboration, yet they often fail to consistently outperform strong single-agent baselines due to error propagation at inter-agent message handoffs. In this work, we conduct a systematic empirical analysis of such failures and introduce an edge-level error taxonomy that identifies four dominant error types: Data Gap, Signal Corruption, Referential Drift, and Capability Gap, as primary sources of failure in multi-agent interactions. Building on this taxonomy, we propose AgentAsk, a lightweight clarification module designed to intervene at the edge level in MAS to prevent cascading errors. The module operates by strategically applying minimal clarifications at critical points within the system, improving the accuracy and efficiency of the overall task. AgentAsk is trained to balance the trade-offs between clarification cost, latency, and accuracy, while it is also architecture-agnostic and can be easily integrated into existing systems. Evaluated across five benchmarks, AgentAsk consistently improves accuracy by up to 4.69%, while keeping latency and extra costs below 10% compared to baseline MAS, showcasing its high efficiency and minimal overhead. The code is available at https://anonymous.4open.science/r/AgentAsk-3432.
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively. Code is available at https://github.com/Flipped-May/TSPO.
Large language models (LLM) have emerged as a promising avenue for time series forecasting, offering the potential to integrate multimodal data. However, existing LLM-based approaches face notable limitations—such as marginalized role in model architectures, reliance on coarse statistical text prompts, and lack of interpretability. In this work, we introduce Augur, a fully LLM driven time series forecasting framework that exploits LLM causal reasoning to discover and use directed causal associations among covariates. Augur uses a two stage teacher student architecture where a powerful teacher LLM infers a directed causal graph from time series using heuristic search together with pairwise causality testing. A lightweight student agent then refines the graph and fine tune on high confidence causal associations that are encoded as rich textual prompts to perform forecasting. This design improves predictive accuracy while yielding transparent, traceable reasoning about variable interactions. Extensive experiments on real-world datasets with 25 baselines demonstrate that Augur achieves competitive performance and robust zero-shot generalization.
While Retrieval-Augmented Generation (RAG) has become a standard paradigm for mitigating hallucinations in Large Language Models (LLMs), its effectiveness in complex medical reasoning remains limited. Existing RAG methods suffer from two main challenges: First, **Semantic Drift**: without explicit domain constraints, LLM-driven query decomposition often deviates from the original clinical intent, introducing substantial noise that degrades retrieval relevance. Second, **Concatenation Fallacy**: retrieved evidence from different semantic aspects is aggregated in a naive, unstructured manner, without modeling their inter-dependencies and potential conflicts, which ultimately undermines downstream reasoning. To address these challenges, we propose **Med-SRAF**, a multi-agent retrieval augmentation framework guided by medical domain knowledge. This framework reconstructs the traditional RAG process through two core mechanisms: (1) Intent-driven Semantic Routing, where a UMLS-based NavigationAgent dynamically maps queries to medical dimensions for strategic search space pruning; and (2) Evidence-based Agentic Fusion, where a FusionAgent resolves conflicts among dimension-specific evidence to build logically consistent reasoning chains. Extensive experiments on five widely used medical benchmarks show that Med-SRAF consistently outperforms existing general RAG baselines, achieving an average accuracy improvement of over **4.9%**, highlighting its effectiveness in robust and interpretable medical reasoning. Our code is at https://anonymous.4open.science/r/MultiAgent_RAG-F6DC.

2025

Large Language Model (LLM)-based Multi-agent Systems (MAS) have demonstrated remarkable capabilities in various complex tasks, ranging from collaborative problem-solving to autonomous decision-making. However, as these systems become increasingly integrated into critical applications, their vulnerability to adversarial attacks, misinformation propagation, and unintended behaviors have raised significant concerns. To address this challenge, we introduce G-Safeguard, a topology-guided security lens and treatment for robust LLM-MAS, which leverages graph neural networks to detect anomalies on the multi-agent utterance graph and employ topological intervention for attack remediation. Extensive experiments demonstrate that G-Safeguard: (I) exhibits significant effectiveness under various attack strategies, recovering over 40% of the performance for prompt injection; (II) is highly adaptable to diverse LLM backbones and large-scale MAS; (III) can seamlessly combine with mainstream MAS with security guarantees.
Large language models (LLMs) have fueled significant progress in intelligent Multi-agent Systems (MAS), with expanding academic and industrial applications. However, safeguarding these systems from malicious queries receives relatively little attention, while methods for single-agent safety are challenging to transfer. In this paper, we explore MAS safety from a topological perspective, aiming at identifying structural properties that enhance security. To this end, we propose NetSafe framework, unifying diverse MAS workflows via iterative RelCom interactions to enable generalized analysis. We identify several critical phenomena for MAS under attacks (misinformation, bias, and harmful content), termed as Agent Hallucination, Aggregation Safety and Security Bottleneck. Furthermore, we verify that highly connected and larger systems are more vulnerable to adversarial spread, with task performance in a Star Graph Topology decreasing by 29.7%. In conclusion, our work introduces a new perspective on MAS safety and discovers unreported phenomena, offering insights and posing challenges to the community.