Chetan Bansal


2026

Large Language Models (LLMs) have demonstrated remarkable capabilities in agentic collaborative problem-solving, albeit a gap exists. Existing frameworks predominantly rely on natural language as a primary representation (format) for agentic communication. However natural language could be ambiguous and verbose. Furthermore, recent works have shown that alternative representations can enhance performance in LLMs on certain tasks. But current approaches lack the intelligence necessary to understand, learn or apply optimal communication representations adaptively. In this paper, we propose to dynamically learn the optimal message representations to enhance agentic performance. We model the optimization problem as an Expanding Markov Decision Process (EMDP) and propose our method named OPTiMACS. We evaluate our system across benchmark datasets of collaborative problem-solving. The results show significant performance improvements while maintaining efficiency. Our work bridges the gap between rigid communication protocols and open-ended natural language by providing an adaptive framework that learns task-aware structural representations.
Web agents struggle to adapt to new websites due to the scarcity of environment specific tasks and demonstrations. Recent works have explored synthetic data generation to address this challenge, however, they suffer from data quality issues where synthesized tasks contain hallucinations that cannot be executed, and collected trajectories are noisy with redundant or misaligned actions. In this paper, we propose SynthAgent, a fully synthetic supervision framework that aims at improving synthetic data quality via dual refinement of both tasks and trajectories. Our approach begins by synthesizing diverse tasks through categorized exploration of web elements, ensuring efficient coverage of the target environment. During trajectory collection, tasks are refined only when conflicts with observations are detected, which mitigates hallucinations while preserving task consistency. After collection, we conduct trajectory refinement with global context to mitigate potential noise or misalignments. Finally, we fine-tune open-source web agents on the refined synthetic data to adapt them to the target environment. Experimental results demonstrate that SynthAgent outperforms existing synthetic data methods, validating the importance of high-quality synthetic supervision. The code is publicly available at https://github.com/aiming-lab/SynthAgent.

2025

Reward modeling in large language models is known to be susceptible to reward hacking, causing models to latch onto superficial features such as the tendency to generate lists or unnecessarily long responses. In RLHF, and more generally during post-training, flawed reward signals often lead to outputs that optimize for these spurious correlates instead of genuine quality or correctness. We propose **Carmo (Context-Aware Reward Modeling)**, a novel approach that first generates dynamic, context-relevant criteria to ground the reward model prior to producing reward scores. Unlike prior methods that use static rubrics, Carmo leverages powerful LLMs to adaptively create evaluation criteria, e.g., logical consistency, clarity, and depth, tailored to the user query. Our theoretical analysis shows that such criteria generation can mitigate reward hacking. We further demonstrate how Carmo can be distilled into smaller models, thereby lowering the computational cost of alignment. We establish a new state-of-the-art performance on zero shot setting for generative models, with a 2.1% improvement on Reward Bench. Furthermore, alignment performed on the Carmo-curated preference dataset achieves **22.5% and 21.1% LC-WR (%) and WR (%) on Mistral-Base (7B)**. We release our datasets at [huggingface/CARMO](https://huggingface.co/datasets/Multi-preference-Optimization/CARMO-UltraFeedback).
Current Large Language Models excel in general reasoning yet struggle with specialized tasks requiring proprietary or domain-specific knowledge. Fine-tuning large models for every niche application is often infeasible due to black-box constraints and high computational overhead. To address this, we propose a collaborative framework that pairs a specialized weak model with a general strong model. The weak model, tailored to specific domains, produces initial drafts and background information, while the strong model leverages its advanced reasoning to refine these drafts, extending LLMs’ capabilities to critical yet specialized tasks. To optimize this collaboration, we introduce a collaborative feedback to fine-tunes the weak model, which quantifies the influence of the weak model’s contributions in the collaboration procedure and establishes preference pairs to guide preference tuning of the weak model. We validate our framework through experiments on three domains. We find that the collaboration significantly outperforms each model alone by leveraging complementary strengths. Moreover, aligning the weak model with the collaborative preference further enhances overall performance.
Recent Large Language Models (LLMs) have demonstrated satisfying general instruction following ability. However, small LLMs with about 7B parameters still struggle fine-grained format following (e.g., JSON format), which seriously hinder the advancements of their applications. Most existing methods focus on benchmarking general instruction following while overlook how to improve the specific format following ability for small LLMs. Besides, these methods often rely on evaluations based on advanced LLMs (e.g., GPT-4), which can introduce the intrinsic bias of LLMs and be costly due to the API calls. In this paper, we first curate a fully verifiable format following dataset VFF. In contrast to existing works often adopting external LLMs for instruction-following validations, every sample of VFF can be easily validated with a Python function. Further, we propose to leverage this verifiable feature to synthesize massive data for progressively training small LLMs, in order to improve their format following abilities. Experimental results highlight the prevalent limitations in the format following capabilities of 7B level open-source LLMs and demonstrate the effectiveness of our method in enhancing this essential ability.