Zhaoyi Joey Hou
2026
When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue
Tanya Shourya | Yingfan Wang | Zhaoyi Joey Hou | Shamik Roy | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Tanya Shourya | Yingfan Wang | Zhaoyi Joey Hou | Shamik Roy | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents’ tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues—such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases. Evaluation with state-of-the-art conversation evaluation frameworks reveals that all approaches remain far from ideal performance, demonstrating the fundamental difficulty of this benchmark.
2025
Leveraging Large Models to Evaluate Novel Content: A Case Study on Advertisement Creativity
Zhaoyi Joey Hou | Adriana Kovashka | Xiang Lorraine Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zhaoyi Joey Hou | Adriana Kovashka | Xiang Lorraine Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Evaluating creativity is challenging, even for humans, not only because of its subjectivity but also because it involves complex cognitive processes. Inspired by work in marketing, we attempt to break down visual advertisement creativity into atypicality and originality. With fine-grained human annotations on these dimensions, we propose a suite of tasks specifically for such a subjective problem. We also evaluate the alignment between state-of-the-art (SoTA) vision language models (VLMs) and humans on our proposed benchmark, demonstrating both the promises and challenges of using VLMs for automatic creativity assessment.