Yilong Wu
2026
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. LLMEval-Fair is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 30-month longitudinal study of nearly 60 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-Fair offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models
Junjie Ye | Caishuang Huang | Zhuohan Chen | Wenjie Fu | Chenyuan Yang | Leyi Yang | Yilong Wu | Peng Wang | Meng Zhou | Xiaolong Yang | Tao Gui | Qi Zhang | Zhongchao Shi | Jianping Fan | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Junjie Ye | Caishuang Huang | Zhuohan Chen | Wenjie Fu | Chenyuan Yang | Leyi Yang | Yilong Wu | Peng Wang | Meng Zhou | Xiaolong Yang | Tao Gui | Qi Zhang | Zhongchao Shi | Jianping Fan | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Instruction following refers to the ability of large language models (LLMs) to generate outputs that satisfy all specified constraints. Existing research has primarily focused on constraint categories, offering limited evaluation dimensions and little guidance for improving instruction-following abilities. To address this gap, we introduce MulDimIF, a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Based on this framework, we design a controllable instruction generation pipeline. Through constraint expansion, conflict detection, and instruction rewriting, we construct 9,106 code-verifiable samples. We evaluate 18 LLMs from six model families and find marked performance differences across constraint settings. For instance, average accuracy decreases from 80.82% at Level I to 36.76% at Level IV. Moreover, training with data generated by our framework significantly improves instruction following without compromising general performance. In-depth analysis indicates that these gains stem largely from parameter updates in attention modules, which strengthen constraint recognition and adherence. Code and data are available in https://github.com/Junjie-Ye/MulDimIF.
2025
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
Junjie Ye | Guanyu Li | SongYang Gao | Caishuang Huang | Yilong Wu | Sixian Li | Xiaoran Fan | Shihan Dou | Tao Ji | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 31st International Conference on Computational Linguistics
Junjie Ye | Guanyu Li | SongYang Gao | Caishuang Huang | Yilong Wu | Sixian Li | Xiaoran Fan | Shihan Dou | Tao Ji | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 31st International Conference on Computational Linguistics
Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined. Furthermore, a sole emphasis on outcomes disregards the complex capabilities required for LLMs to effectively use tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs’ tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world. Evaluations involving ten LLMs across three categories reveal a preference for specific scenarios and limited cognitive abilities in tool learning. Intriguingly, expanding the model size even exacerbates the hindrance to tool learning. The code and data are available at https://github.com/Junjie-Ye/ToolEyes.
PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts
Ming Zhang | Yuhui Wang | Yujiong Shen | Tingyi Yang | Changhao Jiang | Yilong Wu | Shihan Dou | Qinhao Chen | Zhiheng Xi | Zhihao Zhang | Yi Dong | Zhen Wang | Zhihui Fei | Mingyang Wan | Tao Liang | Guojun Ma | Qi Zhang | Tao Gui | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Ming Zhang | Yuhui Wang | Yujiong Shen | Tingyi Yang | Changhao Jiang | Yilong Wu | Shihan Dou | Qinhao Chen | Zhiheng Xi | Zhihao Zhang | Yi Dong | Zhen Wang | Zhihui Fei | Mingyang Wan | Tao Liang | Guojun Ma | Qi Zhang | Tao Gui | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models’ performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in https://github.com/KongLongGeFDU/PFDial.
TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use
Junjie Ye | Yilong Wu | Sixian Li | Yuming Yang | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang | Peng Wang | Zhongchao Shi | Jianping Fan | Zhengyin Du
Findings of the Association for Computational Linguistics: EMNLP 2025
Junjie Ye | Yilong Wu | Sixian Li | Yuming Yang | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang | Peng Wang | Zhongchao Shi | Jianping Fan | Zhengyin Du
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) achieve remarkable advancements by leveraging tools to interact with environments, a critical step toward generalized AI. However, the standard supervised fine-tuning (SFT) approach, which relies on large-scale datasets, often overlooks task-specific characteristics in tool use, leading to performance bottlenecks. To address this issue, we analyze three existing LLMs and uncover key insights: training data can inadvertently impede tool-use behavior, token importance is distributed unevenly, and errors in tool calls fall into a small set of categories. Building on these findings, we propose TL-Training, a task-feature-based framework that mitigates the effects of suboptimal training data, dynamically adjusts token weights to prioritize key tokens during SFT, and incorporates a robust reward mechanism tailored to error categories, optimized through proximal policy optimization. We validate TL-Training by training CodeLLaMA-2-7B and evaluating it on four open-source test sets. Our results demonstrate that the LLM trained by our method matches or surpasses both open- and closed-source LLMs in tool-use performance using only 1,217 training data points. Additionally, our method enhances robustness in noisy environments and improves general task performance, offering a scalable and efficient paradigm for tool-use training in LLMs. Code and data are available at https://github.com/Junjie-Ye/TL-Training.
2024
TransferTOD: A Generalizable Chinese Multi-Domain Task-Oriented Dialogue System with Transfer Capabilities
Ming Zhang | Caishuang Huang | Yilong Wu | Shichun Liu | Huiyuan Zheng | Yurui Dong | Yujiong Shen | Shihan Dou | Jun Zhao | Junjie Ye | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Ming Zhang | Caishuang Huang | Yilong Wu | Shichun Liu | Huiyuan Zheng | Yurui Dong | Yujiong Shen | Shihan Dou | Jun Zhao | Junjie Ye | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Task-oriented dialogue (TOD) systems aim to efficiently handle task-oriented conversations, including information collection. How to utilize TOD accurately, efficiently and effectively for information collection has always been a critical and challenging task. Recent studies have demonstrated that Large Language Models (LLMs) excel in dialogue, instruction generation, and reasoning, and can significantly enhance the performance of TOD through fine-tuning. However, current datasets primarily cater to user-led systems and are limited to predefined specific scenarios and slots, thereby necessitating improvements in the proactiveness, diversity, and capabilities of TOD. In this study, we present a detailed multi-domain task-oriented data construction process for conversations, and a Chinese dialogue dataset generated based on this process, **TransferTOD**, which authentically simulates human-computer dialogues in 30 popular life service scenarios. Leveraging this dataset, we trained a model using full-parameter fine-tuning called **TransferTOD-7B**, showcasing notable abilities in slot filling and questioning. Our work has demonstrated its strong generalization capabilities in various downstream scenarios, significantly enhancing both data utilization efficiency and system performance. The data is released in https://github.com/KongLongGeFDU/TransferTOD.
ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
Junjie Ye | Sixian Li | Guanyu Li | Caishuang Huang | Songyang Gao | Yilong Wu | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Junjie Ye | Sixian Li | Guanyu Li | Caishuang Huang | Songyang Gao | Yilong Wu | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tool learning is widely acknowledged as a foundational approach or deploying large language models (LLMs) in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it frequently neglects emerging safety considerations tied to their application. To fill this gap, we present ToolSword, a comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning. Specifically, ToolSword delineates six safety scenarios for LLMs in tool learning, encompassing malicious queries and jailbreak attacks in the input stage, noisy misdirection and risky cues in the execution stage, and harmful feedback and error conflicts in the output stage. Experiments conducted on 11 open-source and closed-source LLMs reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even GPT-4 is susceptible to. Moreover, we conduct further studies with the aim of fostering research on tool learning safety. The data will be released upon acceptance of the paper.
RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning
Junjie Ye | Yilong Wu | Songyang Gao | Caishuang Huang | Sixian Li | Guanyu Li | Xiaoran Fan | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Junjie Ye | Yilong Wu | Songyang Gao | Caishuang Huang | Sixian Li | Guanyu Li | Xiaoran Fan | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Tool learning has generated widespread interest as a vital means of interaction between Large Language Models (LLMs) and the physical world. Current research predominantly emphasizes LLMs’ capacity to utilize tools in well-structured environments while overlooking their stability when confronted with the inevitable noise of the real world. To bridge this gap, we introduce *RoTBench*, a multi-level benchmark for evaluating the robustness of LLMs in tool learning. Specifically, we establish five external environments, each featuring varying levels of noise (i.e., Clean, Slight, Medium, Heavy, and Union), providing an in-depth analysis of the model’s resilience across three critical phases: tool selection, parameter identification, and content filling. Experiments involving six widely-used models underscore the urgent necessity for enhancing the robustness of LLMs in tool learning. For instance, the performance of GPT-4 even drops significantly from 80.00 to 58.10 when there is no substantial change in manual accuracy. More surprisingly, the noise correction capability inherent in the GPT family paradoxically impedes its adaptability in the face of mild noise. In light of these findings, we propose RoTTuning, a strategy that enriches the diversity of training environments to bolster the robustness of LLMs in tool learning. The code and data are available at https://github.com/Junjie-Ye/RoTBench.
Search
Fix author
Co-authors
- Tao Gui 8
- Xuan-Jing Huang (黄萱菁) 8
- Junjie Ye (叶俊杰) 6
- Qi Zhang 6
- Caishuang Huang 5
- Shihan Dou 4
- Sixian Li 4
- Songyang Gao 3
- Guanyu Li 3
- Yujiong Shen 3
- Zhiheng Xi 3
- Ming Zhang 3
- Xiaoran Fan 2
- Jianping Fan 2
- Changhao Jiang 2
- Shichun Liu 2
- Zhongchao Shi 2
- Peng Wang 2
- Qi Zhang 2
- Mingxu Chai 1
- Qinhao Chen 1
- Zhuohan Chen 1
- Jingyi Deng 1
- Yi Dong 1
- Yurui Dong 1
- Zhengyin Du 1
- Zhihui Fei 1
- Wenjie Fu 1
- Yueyuan Huang 1
- Tao Ji 1
- Tao Liang 1
- Guojun Ma 1
- Qiyuan Peng 1
- Huayu Sha 1
- Kexin Tan 1
- Jingqi Tong 1
- Mingyang Wan 1
- Yuhui Wang 1
- Junzhe Wang 1
- Yuhui Wang 1
- Zhen Wang 1
- Mingqi Wu 1
- Tingyi Yang 1
- Chenyuan Yang 1
- Leyi Yang 1
- Xiaolong Yang 1
- Yuming Yang 1
- Yue Zhang 1
- Zhihao Zhang 1
- Zhihao Zhang 1
- Jun Zhao 1
- Huiyuan Zheng 1
- Meng Zhou 1