Wentao Shi

2026

Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search
Wentao Shi | Zichun Yu | Fuli Feng | Xiangnan He | Chenyan Xiong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Model (LLM) based multi-agent systems (MAS) show strong potential for tackling complex tasks through collaborative intelligence. Monte Carlo Tree Search (MCTS) based methods provide promising approaches for enhancing MAS self-training by generating synthetic data, using Q-values to estimate agent contributions. However, relying solely on Q-values may misalign with the goal of selecting data most beneficial for MAS improvement. To address this discrepancy, we propose **D**ata **I**nfluence-oriented **T**ree **S**earch (**DITS**), a novel framework that incorporates influence scores to guide both tree search and data selection in data synthesis. By leveraging influence scores, we effectively identify the most impactful data for MAS improvement, thereby enhancing model performance. Furthermore, we derive a novel influence score estimation method tailored for non-differentiable metrics, significantly reducing computational overhead by calculating performance changes on the validation set. Extensive experiments on three different multi-agent tasks demonstrate the robustness and effectiveness of the proposed methods. Notably, our findings reveal that allocating more resources to estimate influence scores, rather than Q-values, during data synthesis can more effectively and efficiently enhance model training. The code is available at https://anonymous.4open.science/r/DITS-F1C4/.

pdf bib abs

Large language models have seen widespread adoption, yet they remain vulnerable to multi-turn jailbreak attacks, threatening their safe deployment. This has led to the task of training automated multi-turn attackers to probe model safety vulnerabilities. However, existing approaches typically rely on turn-level optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate this task as a multi-turn reinforcement learning problem, directly optimizing the harmfulness of the final-turn response as the outcome reward. To address the sparse supervision of the outcome reward, we introduce TROJail, which employs two process rewards to evaluate the utility of intermediate prompts and integrate them into advantage estimation. These rewards (1) penalize overly harmful prompts that trigger the model’s refusal mechanism, and (2) encourage steering the semantic relevance of responses toward the targeted harmful content. Experimental results show improved attack success rates across multiple models and benchmarks, highlighting the effectiveness of our approach. The code is available at https://anonymous.4open.science/r/TROJail. Warning: This paper contains examples of harmful content.

pdf bib abs

As reinforcement learning continues to scale the training of large language model–based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored.We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains—search, data systems, and graphical user interfaces—comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents’ abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.

2025

pdf bib abs

Most recommender systems focus on short-term objectives such as click-through rate, often at the expense of long-term user satisfaction. This can lead to echo chambers, where users are repeatedly exposed to redundant content. While recent efforts integrate Large Language Models (LLMs) into recommendation, they typically inherit this short-sighted focus. In this work, we highlight unpaired feedback—implicit signals such as continued engagement (positive) or silent disengagement (negative) that lack explicit contrastive labels—as a key challenge for long-term recommendation. Effectively learning from such feedback is crucial for improving LLM-based recommenders in dynamic user environments. To this end, we propose ULRec (Unpaired Feedback for Long-Term LLM-based Recommendation Tuning), a simple framework that fine-tunes LLMs using both positive and negative unpaired feedback. ULRec leverages the KTO algorithm to incorporate these signals without requiring paired supervision. Despite its simplicity, ULRec consistently improves long-term recommendation performance, demonstrating the value of modeling unpaired user feedback.

pdf bib abs

To adapt large language models (LLMs) to ranking tasks, existing list-wise methods, represented by list-wise Direct Preference Optimization (DPO), focus on optimizing partial-order or full-order list ranking consistency for LLMs to enhance their ranking abilities.However, we argue that optimizing top-K ranking consistency could be more appropriate for real-world applications. There are two main reasons: (1) users are typically concerned with only the top-K results, making top-K ranking more important, and (2) tail items often lack precise feedback, making top-K ranking more reliable. Based on this, we propose K-order Ranking Preference Optimization (KPO) by extending the DPO’s Plackett-Luce model to accommodate top-K rankings. Additionally, recognizing that the number of important items can vary across queries, we extend KPO to dynamically determine appropriate K for different samples and introduce a curriculum learning strategy to boost training efficiency. Extensive experiments demonstrate the effectiveness of KPO, highlighting its high sample efficiency and robustness to noise. The code is available at https://github.com/Lanyu0303/KPO.

pdf bib abs

Multi-Objective Alignment (MOA) aims to align LLMs’ responses with multiple human preference objectives, with Direct Preference Optimization (DPO) emerging as a prominent approach. However, we find that DPO-based MOA approaches suffer from widespread preference conflicts in the data, where different objectives favor different responses. This results in conflicting optimization directions, hindering the optimization on the Pareto Front. To address this, we propose to construct Pareto-optimal responses to resolve preference conflicts. To efficiently obtain and utilize such responses, we propose a self-improving DPO framework that enables LLMs to self-generate and select Pareto-optimal responses for self-supervised preference alignment. Extensive experiments on two datasets demonstrate the superior Pareto Front achieved by our framework compared to various baselines

2024

pdf bib abs

Direct Multi-Turn Preference Optimization for Language Agents
Wentao Shi | Mengqi Yuan | Junkang Wu | Qifan Wang | Fuli Feng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss.