Muyu He
2026
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents
Muyu He | Anand Kumar | Soumyadeep Bakshi | James Zou | Nazneen Rajani
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Muyu He | Anand Kumar | Soumyadeep Bakshi | James Zou | Nazneen Rajani
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as being more impatient, incoherent, or skeptical, can cause sharp drops in agent performance, revealing how brittle current AI agents are. Today’s benchmarks fail to capture this fragility: agents may perform well under standard evaluations but degrade spectacularly in more realistic and varied settings. We address this robustness testing gap by introducing TraitBasis, a lightweight, model-agnostic method for systematically stress testing AI agents. TraitBasis learns directions in activation space corresponding to steerable user traits (e.g., impatience or incoherence), which can be controlled, scaled, composed, and applied at inference time without any fine-tuning or extra data. Using TraitBasis, we extend τ-Bench to τ-bench, where user behaviors are altered via controlled trait vectors. We observe an average 4%–20% performance degradation on τ-bench across frontier models, highlighting the lack of robustness of current AI agents to variations in user behavior. Together, these results highlight both the critical role of robustness testing and the promise of TraitBasis as a simple, data-efficient, and compositional tool. By powering simulation-driven stress tests and training loops, TraitBasis opens the door to building AI agents that remain reliable in the unpredictable dynamics of real-world human interactions. We plan to open-source τ-bench, across four domains: airline, retail, telecom, and telehealth, so the community can systematically QA their agents under realistic, behaviorally diverse intents and trait scenarios.
2025
TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games
Yuan Yuan | Muyu He | Muhammad Adil Shahid | Ziyang Li | Jiani Huang | Li Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yuan Yuan | Muyu He | Muhammad Adil Shahid | Ziyang Li | Jiani Huang | Li Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This paper introduces TurnaboutLLM, a novel framework and dataset for evaluating the deductive reasoning abilities of Large Language Models (LLMs) by leveraging the interactive gameplay of detective games Ace Attorney and Danganronpa. The framework tasks LLMs with identifying contradictions between testimonies and evidences within long narrative contexts, a challenging task due to the large answer space and diverse reasoning types presented by its questions. We evaluate twelve state-of-the-art LLMs on the dataset, hinting at limitations of popular strategies for enhancing deductive reasoning such as extensive thinking and Chain-of-Thought prompting. The results also suggest varying effects of context size, reasoning steps and answer space size on model performance. Overall, TurnaboutLLM presents a substantial challenge for LLMs’ deductive reasoning abilities in complex, narrative-rich environments.