Wenxuan Wang - ACL Anthology

This page is part of a temporary preview of a proposed change that may be incomplete or contain mistakes. It is not official and will be removed when the change is merged or abandoned.

Wenxuan Wang

Other people with similar names: Wenxuan Wang

Unverified author pages with similar names: Wenxuan Wang

2026

Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation
Jen-tse Huang | Chang Chen | Shiyang Lai | Wenxuan Wang | Michelle R Kaufman | Mark Dredze
Findings of the Association for Computational Linguistics: ACL 2026

Short-video platforms have become major channels for misinformation, where deceptive claims frequently leverage visual experiments and social cues. While Multimodal Large Language Models (MLLMs) have demonstrated impressive reasoning capabilities, their robustness against misinformation entangled with cognitive biases remains under-explored. In this paper, we introduce a comprehensive evaluation framework using a high-quality, manually annotated dataset of 200 short videos spanning four health domains. This dataset provides fine-grained annotations for three deceptive patterns—experimental errors, logical fallacies, and fabricated claims—each verified by evidence such as national standards and academic literature. We evaluate eight frontier MLLMs across five modality settings. Experimental results demonstrate that Gemini-2.5-Pro achieves the highest performance in the multimodal setting with a belief score of 71.5/100, while o3 performs the worst at 35.2. Furthermore, we investigate social cues that induce false beliefs in videos and find that models are susceptible to biases like authoritative channel IDs.

Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment in which an LLM acts as a dictator, distributing tasks to heterogeneous recipients with different returns on investment (ROI). The benchmark is designed to create a dilemma between maximizing collective efficiency (i.e., overall ROI) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state-of-the-art LLMs. Our findings reveal several key insights, including: (i) LLMs’ general ability, as measured by popular Arena leaderboards, misaligns with their allocation skills; (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing overall productivity at the expense of inequality. (iii) Allocation behaviors are highly manipulated, easily perturbed by common persuasion strategies. These results highlight the risks of deploying current LLMs as societal decision-makers and underscore the need for specialized benchmarks and alignment for AI governance.

A Survey of Large Models in Sports
Yichen Xu | Jianzhe Ma | Chuhan Wang | Zhonghao Cao | Liangyu Chen | Wenxuan Wang | Qin Jin
Findings of the Association for Computational Linguistics: ACL 2026

Sports have witnessed growing global enthusiasm in recent years, serving as a vital force for physical health, cultural exchange, social connection, and economic growth. The rapid advancement of large models, particularly (multimodal) large language models (M)LLMs, has demonstrated transformative potential to reshape sports understanding, analysis, and interaction across diverse domains. This paper presents a comprehensive survey of large models in sports, including (i) an overview of tasks and applications across different participant groups; (ii) a detailed analysis of sports-related datasets and benchmarks; and (iii) a critical discussion of current challenges and future directions. Our goal is to establish a foundation for advancing research and practical development of large-model-driven sports intelligence. An open-source GitHub repository is maintained at: https://github.com/Road2Redemption/Awesome_Large_Models_In_Sports1.

Identifying the Achilles’ Heel: An Iterative Method for Uncovering Factual Errors in Large Language Models
Wenxuan Wang | Yuk-Kit Chan | Zixuan Ling | Shi Juluan | Youliang Yuan | Jen-tse Huang | Yifei Zhang | Wenxiang Jiao | Zhaopeng Tu | Michael R. Lyu
Findings of the Association for Computational Linguistics: ACL 2026

Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs’ veracity are limited by the need for extensive human labor, test data contamination, or limited scope, hindering efficient and effective exposure of errors. To address these challenges, we propose HalluHunter, a novel, fully automated framework for systematically uncovering factual inaccuracies in LLMs. HalluHunter employs a knowledge-graph-based approach, extracting fact triplets to generate diverse question types for single- and multi-hop reasoning using rule-based Natural Language Processing (NLP) techniques. Its iterative process starts with random triplet selection for question generation, followed by adaptive selection in subsequent iterations, targeting triplets where LLMs frequently err based on their performance analysis. Our extensive tests on nine prominent LLMs reveal that HalluHunter can trigger factual errors in up to 55% of questions in these models. Moreover, we demonstrate that HalluHunter’s test cases, particularly in adaptive selection, could further expose the weaknesses in benchmarking the factuality in LLMs meanwhile maintaining the coverage of questions. All code, data, and results will be released for future research.

Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification
Yuxuan Wan | Tianqing Fang | Zaitang LI | Yintong Huo | Wenxuan Wang | Haitao Mi | Dong Yu | Michael R. Lyu
Findings of the Association for Computational Linguistics: ACL 2026

Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving.While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: test-time self-evolving the agent’s ability by iteratively verifying the policy model’s outputs, guided by meticulously crafted rubrics. This approach gives rise to an inference-time scaling of verification, wherein an agent self-improves at test time by evaluating its generated answers to produce iterative feedback and refinements without any additional training. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%–48% in meta-evaluation F1 score. To enable practical test-time self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping—refining responses without additional training. This test-time scaling delivers 8%–11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.

AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor
Shu Yang | Jingyu Hu | Tong Li | Hanqi Yan | Wenxuan Wang | Di Wang
Findings of the Association for Computational Linguistics: ACL 2026

We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety–utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction, to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.

LongMP-Bench: A Benchmark for Multimodal Persona Understanding in Long-Term Dialogues
Zhuoqun Li | Zhaopei Huang | Wenxuan Wang | Qin Jin
Findings of the Association for Computational Linguistics: ACL 2026

Understanding multimodal user personas in long-term dialogues is essential for building personalized and human-like dialogue systems. However, existing datasets suffer from limited persona diversity and static, overly simplified settings, making them insufficient for capturing the complexity of real-world interactions. To address these limitations, we introduce LongMP-Bench, a benchmark designed to evaluate the capabilities of models in understanding evolving user personas within long-term multimodal dialogues. We present a multi-step, scalable data construction pipeline that generates long-term interaction records centered around multimodal personas, followed by human refinement for quality assurance. The resulting dataset contains long conversations from 150 users, each exhibiting visual consistency and dynamic persona development over time. Built on this dataset, we define a suite of tasks to comprehensively assess models’ ability to track persona evolution, integrate visual and textual inputs, and apply persona understanding in realistic dialogue scenarios. Extensive experiments on LongMP-Bench highlight the substantial challenges in multimodal persona understanding, especially in tracking persona shifts and leveraging multimodal context effectively. We will release our benchmark and code to facilitate future research in multimodal and personalized dialogue systems.

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards
Youliang Yuan | Qiuyang Mang | Jingbang Chen | Hong Wan | Xiaoyuan Liu | Junjielong Xu | Jen-tse Huang | Wenxuan Wang | Wenxiang Jiao | Pinjia He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we observe that current models are susceptible to reward hacking, leading to a substantial overestimation of a model’s reasoning ability. This is evidenced by a high incidence of “false positives”—solutions that reach the correct answer through an unsound process.Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps—abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest that these Miracle Steps are linked to answer-recall shortcuts, including memorization from pretraining, where the model accesses the correct answer independently of its reasoning chain.To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics.The RRM explicitly penalizes logical flaws and encourages rigorous deduction.When integrated into an RL pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks.Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%.Our work demonstrates that rewarding the solution process is crucial for building accurate and reliable models.

Exploring Attention Attractors in Large Language Models
Ziheng Wang | Zihao Yue | Wenxuan Wang | Qin Jin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper explores attention attractors, tokens that draw significantly high attention, in large language models. We analyze them from three perspectives: (1) Functionality: We demonstrate their role in aggregating information from preceding contexts to facilitate future predictions. (2) Distribution: Through layer-wise and token-wise analysis, we reveal that attention attractors are widely distributed across layers but predominantly originate from low-semantic words like "_the". (3) Mechanism: We demonstrate the correlation between attention weights allocated to tokens with their specific activation dimension values. We hope these findings provide new insights into the attention mechanisms of large language models and inspire further exploration.

JARVIS or Ultron? A Survey on the Safety and Security Threats of Computer-Using Agents
Ada Chen | Yongjiang Wu | Junyuan Zhang | Jingyu Xiao | Shu Yang | Jen-tse Huang | Kun Wang | Wenxuan Wang | Shuai Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of Computer-Using Agents (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: (i) define the CUA that suits safety analysis; (ii) categorize current safety threats among CUAs; (iii) propose a comprehensive taxonomy of existing defensive strategies; (iv) summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.

POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
Yichen Xu | Liangyu Chen | Liang Zhang | Zihao Yue | Jianzhe Ma | Wenxuan Wang | Qin Jin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Charts are a universally adopted medium for data communication, yet existing chart understanding benchmarks are overwhelmingly English-centric, limiting their accessibility and relevance to global audiences. To address this limitation, we introduce PolyChartQA, the first large-scale multilingual benchmark for chart question answering, comprising 22,606 charts and 26,151 QA pairs across 10 diverse languages. PolyChartQA is constructed through a scalable pipeline that enables efficient multilingual chart generation via data translation and code reuse, supported by LLM-based translation and rigorous quality control. We systematically evaluate multilingual chart understanding with PolyChartQA on state-of-the-art LVLMs and reveal a significant performance gap between English and other languages, particularly low-resource ones. Additionally, we introduce a companion multilingual chart question answering training set, PolyChartQA-Train, on which fine-tuning LVLMs yields substantial gains in multilingual chart understanding across diverse model sizes and architectures. Together, our benchmark provides a foundation for developing globally inclusive vision-language models capable of understanding charts across diverse linguistic contexts. Codes and datasets are available on https://github.com/Road2Redemption/PolyChartQA.

Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
Wenxuan Wang | Zizhan Ma | Guo Yu | Yiu-Fai Cheung | Meidan Ding | Jie Liu | Wenting Chen | Linlin Shen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs benchmark development into five stages from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 56 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck is both a diagnostic tool for existing benchmarks and an actionable guideline for a more standardized, reliable, and transparent approach to evaluating AI in healthcare.

MedEinst: Benchmarking the Einstellung Effect in Medical LLMs through Counterfactual Differential Diagnosis
Wenting Chen | Guolin Huang | Wenxuan Wang | Zhongrui Zhu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite achieving high accuracy on medical benchmarks, LLMs exhibit the Einstellung Effect in clinical diagnosis—relying on statistical shortcuts rather than patient-specific evidence, causing misdiagnosis in atypical cases. Existing benchmarks fail to detect this critical failure mode. We introduce MedEinst, a counterfactual benchmark with 5,383 paired clinical cases across 49 diseases. Each pair contains a control case and a "trap" case with altered discriminative evidence that flips the diagnosis. We measure susceptibility via Bias Trap Rate—probability of misdiagnosing traps despite correctly diagnosing controls. Evaluation shows frontier models achieve high baseline accuracy but severe bias trap rates. Thus, we propose ECR-Agent, aligning LLM reasoning with Evidence-Based Medicine via two components: (1) Dynamic Causal Inference (DCI) performs structured reasoning through dual-pathway perception, dynamic causal graph reasoning across three levels (association, intervention, counterfactual), and evidence audit for final diagnosis; (2) Critic-Driven Graph Memory Evolution (CGME) iteratively refines the system by storing validated reasoning paths in an exemplar base and consolidating disease-specific knowledge into evolving illness graphs. Source code is to be released.

A Survey of Deep Learning for Geometry Problem Solving
Jianzhe Ma | Wenxuan Wang | Qin Jin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI’s mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper presents a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of state-of-the-art performance, existing challenges, and promising future directions. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We maintain a list of relevant papers: https://github.com/majianz/dl4gps.

2025

Where Fact Ends and Fairness Begins: Redefining AI Bias Evaluation through Cognitive Biases
Jen-tse Huang | Yuhang Yan | Linqi Liu | Yixin Wan | Wenxuan Wang | Kai-Wei Chang | Michael R. Lyu
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent failures such as Google Gemini generating people of color in Nazi-era uniforms illustrate how AI outputs can be factually plausible yet socially harmful. AI models are increasingly evaluated for “fairness,” yet existing benchmarks often conflate two fundamentally different dimensions: factual correctness and normative fairness. A model may generate responses that are factually accurate but socially unfair, or conversely, appear fair while distorting factual reality. We argue that identifying the boundary between fact and fair is essential for meaningful fairness evaluation. We introduce Fact-or-Fair, a benchmark with (i) objective queries aligned with descriptive, fact-based judgments, and (ii) subjective queries aligned with normative, fairness-based judgments. Our queries are constructed from 19 statistics and are grounded in cognitive psychology, drawing on representativeness bias, attribution bias, and ingroup–outgroup bias to explain why models often misalign fact and fairness. Experiments across ten frontier models reveal different levels of fact-fair trade-offs. By reframing fairness evaluation, we provide both a new theoretical lens and a practical benchmark to advance the responsible model assessments. Our test suite is publicly available at https://github.com/uclanlp/Fact-or-Fair.

AI Sees Your Location—But With A Bias Toward The Wealthy World
Jingyuan Huang | Jen-tse Huang | Ziyi Liu | Xiaoyuan Liu | Wenxuan Wang | Jieyu Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Visual-Language Models (VLMs) have shown remarkable performance across various tasks, particularly in recognizing geographic information from images. However, VLMs still show regional biases in this task. To systematically evaluate these issues, we introduce a benchmark consisting of 1,200 images paired with detailed geographic metadata. Evaluating four VLMs, we find that while these models demonstrate the ability to recognize geographic information from images, achieving up to 53.8% accuracy in city prediction, they exhibit significant biases. Specifically, performance is substantially higher for economically developed and densely populated regions compared to less developed (-12.5%) and sparsely populated (-17.0%) areas. Moreover, regional biases of frequently over-predicting certain locations remain. For instance, they consistently predict Sydney for images taken in Australia, shown by the low entropy scores for these countries. The strong performance of VLMs also raises privacy concerns, particularly for users who share images online without the intent of being identified. Our code and dataset are publicly available at https://github.com/uscnlp-lime/FairLocator.

VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models
Jen-tse Huang | Jiantong Qin | Jianping Zhang | Youliang Yuan | Wenxuan Wang | Jieyu Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

This research investigates both explicit and implicit social biases exhibited by Vision-Language Models (VLMs). The key distinction between these bias types lies in the level of awareness: explicit bias refers to conscious, intentional biases, while implicit bias operates subconsciously. To analyze explicit bias, we directly pose questions to VLMs related to gender and racial differences: (1) Multiple-choice questions based on a given image (e.g., “What is the education level of the person in the image?”) (2) Yes-No comparisons using two images (e.g., “Is the person in the first image more educated than the person in the second image?”) For implicit bias, we design tasks where VLMs assist users but reveal biases through their responses: (1) Image description tasks: Models are asked to describe individuals in images, and we analyze disparities in textual cues across demographic groups. (2) Form completion tasks: Models draft a personal information collection form with 20 attributes, and we examine correlations among selected attributes for potential biases. We evaluate Gemini-1.5, GPT-4V, GPT-4o, LLaMA-3.2-Vision and LLaVA-v1.6. Our code and data are publicly available at https://github.com/uscnlp-lime/VisBias.

ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations
Yuejin Xie | Youliang Yuan | Wenxuan Wang | Fan Mo | Jianmin Guo | Pinjia He
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

LLMs are evolving into assistants that leverage tools, significantly expanding their capabilities but also introducing critical safety risks. Current models exhibit notable vulnerabilities, particularly in maintaining safety during multi-step tool interactions and in scenarios involving indirect harm. This paper introduces ToolSafety, a safety fine-tuning dataset designed to address these limitations. ToolSafety comprises 5,668 direct harm samples, 4,311 indirect harm samples, and 4,311 multi-step samples. Key features include support for multi-step safety through synthesized trajectories and realistic, context-aware sample generation. We fine-tuned LLaMA3.1-8B-Instruct and Qwen2.5-7B-Instruct using ToolSafety. Experimental results demonstrate that these models effectively maintain safety in multi-step and indirect harm scenarios. Further analysis into superficial alignment across different decoding strategies, languages, and jailbreak prompts indicates that while some risks persist, the issue is less severe than in multi-step settings. Overall, our approach significantly improves safety across various scenarios with small impact on helpfulness, positioning ToolSafety as a valuable resource for building safer tool-using AI systems.

VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models
Bingrui Sima | Linhua Cong | Wenxuan Wang | Kun He
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The emergence of Multimodal Large Reasoning Models (MLRMs) has enabled sophisticated visual reasoning capabilities by integrating reinforcement learning and Chain-of-Thought (CoT) supervision. However, while these enhanced reasoning capabilities improve performance, they also introduce new and underexplored safety risks. In this work, we systematically investigate the security implications of advanced visual reasoning in MLRMs. Our analysis reveals a fundamental trade-off: as visual reasoning improves, models become more vulnerable to jailbreak attacks. Motivated by this critical finding, we introduce VisCRA (Visual Chain Reasoning Attack), a novel jailbreak framework that exploits the visual reasoning chains to bypass safety mechanisms. VisCRA combines targeted visual attention masking with a two-stage reasoning induction strategy to precisely control harmful outputs. Extensive experiments demonstrate VisCRA’s significant effectiveness, achieving high attack success rates on leading closed-source MLRMs: 76.48% on Gemini 2.0 Flash Thinking, 68.56% on QvQ-Max, and 56.60% on GPT-4o. Our findings highlight a critical insight: the very capability that empowers MLRMs — their visual reasoning — can also serve as an attack vector, posing significant security risks. Warning: This paper contains unsafe examples.

Learning to Ask: When LLM Agents Meet Unclear Instruction
Wenxuan Wang | Shi Juluan | Zixuan Ling | Yuk-Kit Chan | Chaozheng Wang | Cheryl Lee | Youliang Yuan | Jen-tse Huang | Wenxiang Jiao | Michael R. Lyu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Equipped with the capability to call functions, modern LLM agents can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLM agents but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLM agents tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench. We find that due to the next-token prediction training objective, LLM agents tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed, which prompts LLM agents to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLM agents’ performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the Ask-when-Needed significantly outperforms existing frameworks for tool learning in the Noisy ToolBench. We will release all related code and datasets to support future research.

Co-authors

Kai-Wei Chang 1

Jingbang Chen 1

Yiu-Fai Cheung 1

Tianqing Fang 1

Jingyuan Huang 1

Zhaopei Huang 1

Michelle R Kaufman 1

Zhengliang Shi 1

Chaozheng Wang 1

Junjielong Xu 1

Dong Yu (于东) 1

Jianping Zhang 1

Junyuan Zhang 1

Venues