Jiajun Sun


2026

Language agents, i.e., LLM agents, progress rapidly and are increasingly deployed in production environments. This trend underscores the urgent need for rigorous and realistic evaluations. However, most existing benchmarks evaluate agents in simplified, idealized settings. They typically rely on pre-packaged tool interfaces, overlook critical steps, and assume inputs are clean and fully specified. Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. Beyond reasoning and planning, it measures agents’ ability to execute end-to-end procedures, discover tools via exploration, compose tools for unseen tasks, and remain robust to noisy and underspecified information. Experiments on 15 proprietary and open-source models show that even SOTA systems like Gemini and GPT-5 struggle on AgentGym2, revealing a substantial gap between the capability of current agents and the demands of real-world applications.
The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves R² > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. The analysis further reveals diminishing returns from scaling data and model size and provides evidence for an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating retrieval and other augmentation strategies.

2025

Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model’s over-susceptibility to image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable adversarial training method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.