Shi Juluan
2026
Identifying the Achilles’ Heel: An Iterative Method for Uncovering Factual Errors in Large Language Models
Wenxuan Wang | Yuk-Kit Chan | Zixuan Ling | Shi Juluan | Youliang Yuan | Jen-tse Huang | Yifei Zhang | Wenxiang Jiao | Zhaopeng Tu | Michael R. Lyu
Findings of the Association for Computational Linguistics: ACL 2026
Wenxuan Wang | Yuk-Kit Chan | Zixuan Ling | Shi Juluan | Youliang Yuan | Jen-tse Huang | Yifei Zhang | Wenxiang Jiao | Zhaopeng Tu | Michael R. Lyu
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs’ veracity are limited by the need for extensive human labor, test data contamination, or limited scope, hindering efficient and effective exposure of errors. To address these challenges, we propose HalluHunter, a novel, fully automated framework for systematically uncovering factual inaccuracies in LLMs. HalluHunter employs a knowledge-graph-based approach, extracting fact triplets to generate diverse question types for single- and multi-hop reasoning using rule-based Natural Language Processing (NLP) techniques. Its iterative process starts with random triplet selection for question generation, followed by adaptive selection in subsequent iterations, targeting triplets where LLMs frequently err based on their performance analysis. Our extensive tests on nine prominent LLMs reveal that HalluHunter can trigger factual errors in up to 55% of questions in these models. Moreover, we demonstrate that HalluHunter’s test cases, particularly in adaptive selection, could further expose the weaknesses in benchmarking the factuality in LLMs meanwhile maintaining the coverage of questions. All code, data, and results will be released for future research.
2025
Learning to Ask: When LLM Agents Meet Unclear Instruction
Wenxuan Wang | Shi Juluan | Zixuan Ling | Yuk-Kit Chan | Chaozheng Wang | Cheryl Lee | Youliang Yuan | Jen-tse Huang | Wenxiang Jiao | Michael R. Lyu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Wenxuan Wang | Shi Juluan | Zixuan Ling | Yuk-Kit Chan | Chaozheng Wang | Cheryl Lee | Youliang Yuan | Jen-tse Huang | Wenxiang Jiao | Michael R. Lyu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Equipped with the capability to call functions, modern LLM agents can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLM agents but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLM agents tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench. We find that due to the next-token prediction training objective, LLM agents tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed, which prompts LLM agents to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLM agents’ performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the Ask-when-Needed significantly outperforms existing frameworks for tool learning in the Noisy ToolBench. We will release all related code and datasets to support future research.