Zhaoyi Joey Hou

2026

When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue
Tanya Shourya | Yingfan Wang | Zhaoyi Joey Hou | Shamik Roy | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)

Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents’ tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues—such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases. Evaluation with state-of-the-art conversation evaluation frameworks reveals that all approaches remain far from ideal performance, demonstrating the fundamental difficulty of this benchmark.

2025

pdf bib abs

Leveraging Large Models to Evaluate Novel Content: A Case Study on Advertisement Creativity
Zhaoyi Joey Hou | Adriana Kovashka | Xiang Lorraine Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Evaluating creativity is challenging, even for humans, not only because of its subjectivity but also because it involves complex cognitive processes. Inspired by work in marketing, we attempt to break down visual advertisement creativity into atypicality and originality. With fine-grained human annotations on these dimensions, we propose a suite of tasks specifically for such a subjective problem. We also evaluate the alignment between state-of-the-art (SoTA) vision language models (VLMs) and humans on our proposed benchmark, demonstrating both the promises and challenges of using VLMs for automatic creativity assessment.

Co-authors

Tanya Shourya 1

Yingfan Wang 1

Venues

Fix author