Aashraya Sachdeva


2026

Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent’s capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.

2024

Recent advancements in large Language Models (LMs) have significantly enhanced their capabilities across various domains, including natural language understanding and generation. In this paper, we investigate the application of LMs to the specialized task of contact-center Quality Assurance (QA), which involves evaluating conversations between human agents and customers. This task requires both sophisticated linguistic understanding and deep domain knowledge. We conduct a comprehensive assessment of eight LMs, revealing that larger models, such as Claude-3.5-Sonnet, exhibit superior performance in comprehending contact-center conversations. We introduce methodologies to transfer this domain-specific knowledge to smaller models by leveraging evaluation plans generated by more knowledgeable models, with optional human-in-the-loop refinement to enhance the capabilities of smaller models. Notably, our experimental results demonstrate an improvement of up to 18.95% in Macro F1 on an in-house QA dataset. Our findings emphasize the importance of evaluation plans in guiding reasoning and highlight the potential of AI-assisted tools to advance objective, consistent, and scalable agent evaluation processes in contact centers.