Jiajun Sun
2026
AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language agents, i.e., LLM agents, progress rapidly and are increasingly deployed in production environments. This trend underscores the urgent need for rigorous and realistic evaluations. However, most existing benchmarks evaluate agents in simplified, idealized settings. They typically rely on pre-packaged tool interfaces, overlook critical steps, and assume inputs are clean and fully specified. Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. Beyond reasoning and planning, it measures agents’ ability to execute end-to-end procedures, discover tools via exploration, compose tools for unseen tasks, and remain robust to noisy and underspecified information. Experiments on 15 proprietary and open-source models show that even SOTA systems like Gemini and GPT-5 struggle on AgentGym2, revealing a substantial gap between the capability of current agents and the demands of real-world applications.
Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control
Changhao Jiang | Jiahao Chen | Zhenghao Xiang | Zhixiong Yang | Hanchen Wang | Jiabao Zhuang | Xinmeng Che | Jiajun Sun | Hui Li | Yifei Cao | Shihan Dou | Ming Zhang | Junjie Ye | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Changhao Jiang | Jiahao Chen | Zhenghao Xiang | Zhixiong Yang | Hanchen Wang | Jiabao Zhuang | Xinmeng Che | Jiajun Sun | Hui Li | Yifei Cao | Shihan Dou | Ming Zhang | Junjie Ye | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Recent commercial systems such as Suno demonstrate strong capabilities in long-form song generation, while academic research remains largely non-reproducible due to the lack of publicly available training data, hindering fair comparison and progress. To this end, we release a fully open-source system for long-form song generation with fine-grained style conditioning, including a licensed synthetic dataset, training and evaluation pipelines, and Muse, an easy-to-deploy song generation model. The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions paired with audio synthesized by SunoV5. We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens using MuCodec, without task-specific losses, auxiliary objectives, or additional architectural components. Our evaluations find that although Muse is trained with a modest data scale and model size, it achieves competitive performance on phoneme error rate, text–music style similarity, and audio aesthetic quality, while enabling controllable segment-level generation across different musical structures. All data, model weights, and training and evaluation pipelines will be publicly released, paving the way for continued progress in controllable long-form song generation research.
From Scores to Preferences: Redefining Evaluation Paradigm for Speech Quality Reward Modeling
Yifei Cao | Changhao Jiang | Jiabao Zhuang | Jiajun Sun | Ming Zhang | Zhiheng Xi | Hui Li | Shihan Dou | Yuran Wang | Yunke Zhang | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Yifei Cao | Changhao Jiang | Jiabao Zhuang | Jiajun Sun | Ming Zhang | Zhiheng Xi | Hui Li | Shihan Dou | Yuran Wang | Yunke Zhang | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Speech quality assessment (SQA) is typically formulated as a score regression task based on subjective ratings, such as the Mean Opinion Score (MOS), which inherently suffer from inconsistent standards and limit cross-dataset training and evaluation. To address these limitations, we reformulate SQA as a preference-based comparison paradigm and construct MOS-Pref, a large-scale MOS-derived preference dataset. Building on MOS-Pref, we systematically implement and evaluate three reward modeling paradigms: scalar, semi-scalar, and generative reward models, alongside existing SQA approaches. Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) score regression-based approaches generally underperform preference-based methods in both overall performance and generalization; and (3) all reward models struggle on pairs with very small MOS gap. Motivated by these observations, we propose a MOS-aware GRM design that incorporates MOS gap into the reward function during reinforcement learning. Experimental results show that the MOS-aware GRM significantly improves fine-grained speech quality discrimination. We hope this work fosters more rigorous and scalable research in SQA.
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training
Changhao Jiang | Ming Zhang | Yifei Cao | Junjie Ye | Xiaoran Fan | Shihan Dou | Zhiheng Xi | Jiajun Sun | Yi Dong | Yujiong Shen | Jingqi Tong | Baoyu Fan | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Changhao Jiang | Ming Zhang | Yifei Cao | Junjie Ye | Xiaoran Fan | Shihan Dou | Zhiheng Xi | Jiajun Sun | Yi Dong | Yujiong Shen | Jingqi Tong | Baoyu Fan | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves R² > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. The analysis further reveals diminishing returns from scaling data and model size and provides evidence for an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating retrieval and other augmentation strategies.
2025
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Shuo Li | Jiajun Sun | Guodong Zheng | Xiaoran Fan | Yujiong Shen | Yi Lu | Zhiheng Xi | Yuming Yang | Wenming Tan | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Shuo Li | Jiajun Sun | Guodong Zheng | Xiaoran Fan | Yujiong Shen | Yi Lu | Zhiheng Xi | Yuming Yang | Wenming Tan | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model’s over-susceptibility to image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable adversarial training method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.
Search
Fix author
Co-authors
- Tao Gui 5
- Xuan-Jing Huang (黄萱菁) 5
- Zhiheng Xi 4
- Qi Zhang 4
- Yifei Cao 3
- Shihan Dou 3
- Tao Ji 3
- Changhao Jiang 3
- Junjie Ye (叶俊杰) 3
- Ming Zhang 3
- Xiaoran Fan 2
- Hui Li 2
- Yujiong Shen 2
- Yuming Yang 2
- Jiabao Zhuang 2
- Xinmeng Che 1
- Tinggang Chen 1
- Jiahao Chen 1
- Yi Dong 1
- Baoyu Fan 1
- Minghe Gao 1
- Honglin Guo 1
- Xin Guo 1
- Jixuan Huang 1
- Baodai Huang 1
- Jiaming Ji 1
- Guohao Li 1
- Shuo Li 1
- Jiaqi Liu 1
- Chenyu Liu 1
- Dongrui Liu 1
- Zhonghang Lu 1
- Yi Lu 1
- Wenming Tan 1
- Jingqi Tong 1
- Junzhe Wang 1
- Hanchen Wang 1
- Yuran Wang 1
- Zhenghao Xiang 1
- Dingwen Yang 1
- Zhixiong Yang 1
- Qi Zhang 1
- Jiazheng Zhang 1
- Zhihao Zhang 1
- Yunke Zhang 1
- Qi Zhang 1
- Guodong Zheng 1
- Dingwei Zhu 1