Shichun Liu
2026
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. LLMEval-Fair is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 30-month longitudinal study of nearly 60 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-Fair offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding
Deming Ding | Shichun Liu | Enhui Yang | Jiahang Lin | Ziying Chen | Shihan Dou | Honglin Guo | Weiyu Cheng | Pengyu Zhao | Chengjun Xiao | Qunhong Zeng | Qi Zhang | Xuanjing Huang | Qidi Xu | Tao Gui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Deming Ding | Shichun Liu | Enhui Yang | Jiahang Lin | Ziying Chen | Shihan Dou | Honglin Guo | Weiyu Cheng | Pengyu Zhao | Chengjun Xiao | Qunhong Zeng | Qi Zhang | Xuanjing Huang | Qidi Xu | Tao Gui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We will release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.
DARM: Distribution-Aware Reward Modeling by Alleviating Biases from Low Preference-Context Dependency Data
Shaofan Liu | Guoqiang Zhang | Shihan Dou | Huiyuan Zheng | Yiming Zhou | Junjie Ye | Shaowen Wang | Shichun Liu | Jiazheng Zhang | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shaofan Liu | Guoqiang Zhang | Shihan Dou | Huiyuan Zheng | Yiming Zhou | Junjie Ye | Shaowen Wang | Shichun Liu | Jiazheng Zhang | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reward models (RMs) are the surrogate objectives in reinforcement learning from human feedback (RLHF), and their scores directly steer policy optimization. We show that standard RM training is vulnerable in data subsets where response quality depends only weakly on the context: such instances encourage the RM to ignore the context, leading to context neglect and degraded accuracy. To address this failure mode, we propose Distribution-Aware Reward Modeling (DARM), which augments the RM objective with a conditional mutual information regularizer that maximizes context and the predicted reward conditioned on the response. By explicitly preserving the sensitivity of reward signals to the prompting context, DARM reduces over-reliance on response-only features and improves robustness to contextual variation. Extensive experiments across in-distribution and out-of-distribution settings show that DARM trained RMs deliver more accurate and consistent scoring than strong baselines. We further evaluate its downstream impact in RLHF, where DARM produce better aligned policies. We also demonstrate the necessity of each DARM design component and the impact of key parameters on performance through ablation experiments.
2025
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
Ming Zhang | Yujiong Shen | Zelin Li | Huayu Sha | Binze Hu | Yuhui Wang | Chenhao Huang | Shichun Liu | Jingqi Tong | Changhao Jiang | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Ming Zhang | Yujiong Shen | Zelin Li | Huayu Sha | Binze Hu | Yuhui Wang | Chenhao Huang | Shichun Liu | Jingqi Tong | Changhao Jiang | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025
Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Medicine, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains.
Lost in the Context: Insufficient and Distracted Attention to Contexts in Preference Modeling
Shihan Dou | Jiayi Chen | Chenhao Huang | Feng Chen | Wei Chengzhi | Huiyuan Zheng | Shichun Liu | Yan Liu | Chenxiao Liu | Chao Xin | Lin Yan | Zongzhang Zhang | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shihan Dou | Jiayi Chen | Chenhao Huang | Feng Chen | Wei Chengzhi | Huiyuan Zheng | Shichun Liu | Yan Liu | Chenxiao Liu | Chao Xin | Lin Yan | Zongzhang Zhang | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In Reinforcement Learning from Human Feedback (RLHF), the reward model (RM) evaluates the response quality based on the given context and assigns a reward. It plays a crucial role in aligning RLHF with human preferences. Although the current RM training paradigm concatenates the context and response while amplifying the reward difference between good and bad response pairs, we demonstrate that the RM faces two significant issues: i) it often allocates only a small proportion of attention to the context, and ii) it frequently ignores segments of the context that are relevant for evaluating the response quality. These issues undermine the RM’s effectiveness in modeling human preferences. To further address these challenges, we propose AttnRM, a novel optimization framework that enables the RM to concentrate on crucial segments of the context. Experimental results demonstrate that AttnRM significantly improves preference modeling by increasing attention to relevant information within the context. It also enhances the RM’s generalizability and achieves better performance in aligning with human preferences.
Multi-Programming Language Sandbox for LLMs
Shihan Dou | Jiazheng Zhang | Jianxiang Zang | Yunbo Tao | Weikang Zhou | Haoxiang Jia | Shichun Liu | Yuming Yang | Shenxi Wu | Zhiheng Xi | Muling Wu | Rui Zheng | Changze Lv | Limao Xiong | Shaoqing Zhang | Lin Zhang | Wenyu Zhan | Rongxiang Weng | Jingang Wang | Xunliang Cai | Yueming Wu | Ming Wen | Yixin Cao | Tao Gui | Xipeng Qiu | Qi Zhang | Xuanjing Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Shihan Dou | Jiazheng Zhang | Jianxiang Zang | Yunbo Tao | Weikang Zhou | Haoxiang Jia | Shichun Liu | Yuming Yang | Shenxi Wu | Zhiheng Xi | Muling Wu | Rui Zheng | Changze Lv | Limao Xiong | Shaoqing Zhang | Lin Zhang | Wenyu Zhan | Rongxiang Weng | Jingang Wang | Xunliang Cai | Yueming Wu | Ming Wen | Yixin Cao | Tao Gui | Xipeng Qiu | Qi Zhang | Xuanjing Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. In addition, MPLSandbox integrates both traditional and LLM-based code analysis tools, providing a comprehensive analysis of generated code. It also can be effortlessly integrated into the training and deployment of LLMs to improve the quality and correctness of generated code. It also helps researchers streamline their workflows for various LLM-based code-related tasks, reducing the development cost. To validate the effectiveness of MPLSandbox, we conduct extensive experiments by integrating it into several training and deployment scenarios, and employing it to optimize workflows for a wide range of downstream code tasks. Our goal is to enhance researcher productivity on LLM-based code tasks by simplifying and automating workflows through delegation to MPLSandbox.
2024
Self-Demos: Eliciting Out-of-Demonstration Generalizability in Large Language Models
Wei He | Shichun Liu | Jun Zhao | Yiwen Ding | Yi Lu | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: NAACL 2024
Wei He | Shichun Liu | Jun Zhao | Yiwen Ding | Yi Lu | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: NAACL 2024
Large language models (LLMs) have shown promising abilities of in-context learning (ICL), adapting swiftly to new tasks with only few-shot demonstrations. However, current few-shot methods heavily depend on high-quality, query-specific demos, which are often lacking. When faced with out-of-demonstration (OOD) queries, methods that rely on hand-crafted demos or external retrievers might fail. To bridge the gap between limited demos and OOD queries, we propose Self-Demos, a novel prompting method that elicits the inherent generalizability in LLMs by query-aware demo generation. The generated demos strategically interpolate between existing demos and the given query, transforming the query from OOD to ID. To evaluate the effectiveness of our approach, we manually constructed OOD-Toolset, a dataset in the tool-using scenario with over 300 real-world APIs and 1000 instances, each consisting of three tool-use cases as demos and an OOD query. Thorough experiments on our dataset and two public math benchmarks have shown that our method can outperform state-of-the-art baselines in the OOD setting. Moreover, we conduct a range of analyses to validate Self-Demos’s generalization and provide more insights.
TransferTOD: A Generalizable Chinese Multi-Domain Task-Oriented Dialogue System with Transfer Capabilities
Ming Zhang | Caishuang Huang | Yilong Wu | Shichun Liu | Huiyuan Zheng | Yurui Dong | Yujiong Shen | Shihan Dou | Jun Zhao | Junjie Ye | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Ming Zhang | Caishuang Huang | Yilong Wu | Shichun Liu | Huiyuan Zheng | Yurui Dong | Yujiong Shen | Shihan Dou | Jun Zhao | Junjie Ye | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Task-oriented dialogue (TOD) systems aim to efficiently handle task-oriented conversations, including information collection. How to utilize TOD accurately, efficiently and effectively for information collection has always been a critical and challenging task. Recent studies have demonstrated that Large Language Models (LLMs) excel in dialogue, instruction generation, and reasoning, and can significantly enhance the performance of TOD through fine-tuning. However, current datasets primarily cater to user-led systems and are limited to predefined specific scenarios and slots, thereby necessitating improvements in the proactiveness, diversity, and capabilities of TOD. In this study, we present a detailed multi-domain task-oriented data construction process for conversations, and a Chinese dialogue dataset generated based on this process, **TransferTOD**, which authentically simulates human-computer dialogues in 30 popular life service scenarios. Leveraging this dataset, we trained a model using full-parameter fine-tuning called **TransferTOD-7B**, showcasing notable abilities in slot filling and questioning. Our work has demonstrated its strong generalization capabilities in various downstream scenarios, significantly enhancing both data utilization efficiency and system performance. The data is released in https://github.com/KongLongGeFDU/TransferTOD.
Search
Fix author
Co-authors
- Tao Gui 8
- Xuan-Jing Huang (黄萱菁) 8
- Shihan Dou 7
- Qi Zhang 5
- Zhiheng Xi 4
- Yujiong Shen 3
- Ming Zhang 3
- Qi Zhang 3
- Huiyuan Zheng 3
- Mingxu Chai 2
- Chenhao Huang 2
- Changhao Jiang 2
- Huayu Sha 2
- Jingqi Tong 2
- Yilong Wu 2
- Junjie Ye (叶俊杰) 2
- Jiazheng Zhang 2
- Jun Zhao 2
- Xunliang Cai 1
- Yixin Cao 1
- Feng Chen 1
- Jiayi Chen 1
- Ziying Chen 1
- Weiyu Cheng 1
- Wei Chengzhi 1
- Jingyi Deng 1
- Deming Ding 1
- Yiwen Ding 1
- Yurui Dong 1
- Honglin Guo 1
- Wei He 1
- Binze Hu 1
- Caishuang Huang 1
- Yueyuan Huang 1
- Haoxiang Jia 1
- Zelin Li 1
- Jiahang Lin 1
- Chenxiao Liu 1
- Shaofan Liu 1
- Yan Liu 1
- Yi Lu 1
- Changze Lv 1
- Qiyuan Peng 1
- Xipeng Qiu (邱锡鹏) 1
- Kexin Tan 1
- Yunbo Tao 1
- Jingang Wang 1
- Junzhe Wang 1
- Shaowen Wang 1
- Yuhui Wang 1
- Yuhui Wang 1
- Ming Wen 1
- Rongxiang Weng 1
- Mingqi Wu 1
- Muling Wu 1
- Shenxi Wu 1
- Yueming Wu 1
- Chengjun Xiao 1
- Chao Xin 1
- Limao Xiong 1
- Qidi Xu 1
- Lin Yan 1
- Enhui Yang 1
- Yuming Yang 1
- Jianxiang Zang 1
- Qunhong Zeng 1
- Wenyu Zhan 1
- Guoqiang Zhang 1
- Lin Zhang 1
- Shaoqing Zhang 1
- Yue Zhang 1
- Zhihao Zhang 1
- Zongzhang Zhang 1
- Pengyu Zhao 1
- Rui Zheng 1
- Weikang Zhou 1
- Yiming Zhou 1