Ziyi Wang
Other people with similar names: Ziyi Wang
Unverified author pages with similar names: Ziyi Wang
2026
Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents
Ziyi Wang | Yuxuan Lu | Yimeng Zhang | Pei Chen | Ziwei Dong | Jing Huang | Jiri Gesi | Xianfeng Tang | Chen Luo | Qun Liu | Yisi Sang | Hanqing Lu | Manling Li | Jin Lai | Dakuo Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ziyi Wang | Yuxuan Lu | Yimeng Zhang | Pei Chen | Ziwei Dong | Jing Huang | Jiri Gesi | Xianfeng Tang | Chen Luo | Qun Liu | Yisi Sang | Hanqing Lu | Manling Li | Jin Lai | Dakuo Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tool-calling agents are increasingly deployed in real-world customer-facing workflows. Yet most studies on tool-calling agents focus on idealized settings with general, fixed, and well-specified tasks.In real-world applications, user requests are often (1) ambiguous, (2) changing over time, or (3) infeasible due to policy constraints, and training and evaluation data that cover these diverse, complex interaction patterns remain under-represented.To bridge the gap, we present Trajectory2Task a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios: ambiguous intent, changing intent, and infeasible intents.The pipeline first conducts multi-turn exploration to produce valid tool-call trajectories. It then converts these trajectories into user-facing tasks with controlled intent adaptations. This process yields verifiable task that support closed-loop evaluation and training. We benchmark several state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures.Finally, using successful trajectories obtained from task rollouts, we fine-tune lightweight LLMs and find consistent improvements across all three conditions, along with better generalization to unseen tool-use domains, indicating stronger tool-calling ability.
OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
Ziyi Wang | Yuxuan Lu | Wenbo Li | Amirali Amini | Bo Sun | Yakov Bart | Weimin Lyu | Jiri Gesi | Tian Wang | Jing Huang | Yu Su | Upol Ehsan | Malihe Alikhani | Toby Jia-Jun Li | Lydia Chilton | Dakuo Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ziyi Wang | Yuxuan Lu | Wenbo Li | Amirali Amini | Bo Sun | Yakov Bart | Weimin Lyu | Jiri Gesi | Tian Wang | Jing Huang | Yu Su | Upol Ehsan | Malihe Alikhani | Toby Jia-Jun Li | Lydia Chilton | Dakuo Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Can Large Language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating believable human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPeRA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. **OPeRA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales**. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPeRA, we establish **the first benchmark to evaluate how well current LLMs can predict a specific user’s next action** and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.