Mayank Sati


2025

Contact centers process millions of customer conversations daily, requiring Quality Assurance (QA) teams to evaluate agent performance against compliance and service standards, often by answering agent evaluation questionnaires. Traditional manual QA cannot scale to growing volumes, while fully automated evaluation using large language models presents a cost-performance trade-off. High-performing models excel at detecting rare but business-critical Answers of Interest (AoI) but incur prohibitive costs, while smaller fine-tuned models are economical but suffer from poor AoI precision, generating high false positive rates that erode agent trust and waste QA resources. We introduce STREAQ, a two-tier selective routing framework to intelligently route queries between cost-efficient and high-capability models. Based on benchmarking on a proprietary dataset across six large LMs, STREAQ achieves substantial cost reduction while preserving critical performance. Using Nova-Pro, STREAQ reduces daily costs by 48% from 34,162 to17,842 while retaining 88.9% of full-model AoI precision. Our ablation studies reveal that flawed reasoning from smaller models can degrade performance, emphasizing the importance of carefully designing routing systems, making enterprise-scale automated QA both practical and economically viable.

2024

Recent advancements in large Language Models (LMs) have significantly enhanced their capabilities across various domains, including natural language understanding and generation. In this paper, we investigate the application of LMs to the specialized task of contact-center Quality Assurance (QA), which involves evaluating conversations between human agents and customers. This task requires both sophisticated linguistic understanding and deep domain knowledge. We conduct a comprehensive assessment of eight LMs, revealing that larger models, such as Claude-3.5-Sonnet, exhibit superior performance in comprehending contact-center conversations. We introduce methodologies to transfer this domain-specific knowledge to smaller models by leveraging evaluation plans generated by more knowledgeable models, with optional human-in-the-loop refinement to enhance the capabilities of smaller models. Notably, our experimental results demonstrate an improvement of up to 18.95% in Macro F1 on an in-house QA dataset. Our findings emphasize the importance of evaluation plans in guiding reasoning and highlight the potential of AI-assisted tools to advance objective, consistent, and scalable agent evaluation processes in contact centers.