Jiajun Sun
2026
AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language agents, i.e., LLM agents, progress rapidly and are increasingly deployed in production environments. This trend underscores the urgent need for rigorous and realistic evaluations. However, most existing benchmarks evaluate agents in simplified, idealized settings. They typically rely on pre-packaged tool interfaces, overlook critical steps, and assume inputs are clean and fully specified. Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. Beyond reasoning and planning, it measures agents’ ability to execute end-to-end procedures, discover tools via exploration, compose tools for unseen tasks, and remain robust to noisy and underspecified information. Experiments on 15 proprietary and open-source models show that even SOTA systems like Gemini and GPT-5 struggle on AgentGym2, revealing a substantial gap between the capability of current agents and the demands of real-world applications.
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training
Changhao Jiang | Ming Zhang | Yifei Cao | Junjie Ye | Xiaoran Fan | Shihan Dou | Zhiheng Xi | Jiajun Sun | Yi Dong | Yujiong Shen | Jingqi Tong | Baoyu Fan | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Changhao Jiang | Ming Zhang | Yifei Cao | Junjie Ye | Xiaoran Fan | Shihan Dou | Zhiheng Xi | Jiajun Sun | Yi Dong | Yujiong Shen | Jingqi Tong | Baoyu Fan | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves R² > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. The analysis further reveals diminishing returns from scaling data and model size and provides evidence for an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating retrieval and other augmentation strategies.
2025
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Shuo Li | Jiajun Sun | Guodong Zheng | Xiaoran Fan | Yujiong Shen | Yi Lu | Zhiheng Xi | Yuming Yang | Wenming Tan | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Shuo Li | Jiajun Sun | Guodong Zheng | Xiaoran Fan | Yujiong Shen | Yi Lu | Zhiheng Xi | Yuming Yang | Wenming Tan | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model’s over-susceptibility to image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable adversarial training method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.
Search
Fix author
Co-authors
- Tao Gui 3
- Xuan-Jing Huang (黄萱菁) 3
- Zhiheng Xi 3
- Xiaoran Fan 2
- Yujiong Shen 2
- Yuming Yang 2
- Junjie Ye (叶俊杰) 2
- Qi Zhang 2
- Yifei Cao 1
- Tinggang Chen 1
- Yi Dong 1
- Shihan Dou 1
- Baoyu Fan 1
- Minghe Gao 1
- Honglin Guo 1
- Xin Guo 1
- Baodai Huang 1
- Jixuan Huang 1
- Jiaming Ji 1
- Tao Ji 1
- Changhao Jiang 1
- Guohao Li 1
- Shuo Li 1
- Chenyu Liu 1
- Dongrui Liu 1
- Jiaqi Liu 1
- Yi Lu 1
- Zhonghang Lu 1
- Wenming Tan 1
- Jingqi Tong 1
- Junzhe Wang 1
- Dingwen Yang 1
- Jiazheng Zhang 1
- Ming Zhang 1
- Qi Zhang 1
- Qi Zhang 1
- Zhihao Zhang 1
- Guodong Zheng 1
- Dingwei Zhu 1