Varun Nathan


2026

We present a domain-grounded benchmark and evaluation framework for tool-aware plan generation in contact-center analytics, where answering a business-insights query requires decomposing it into executable steps over structured tools (Text2SQL over Snowflake), unstructured tools (RAG over transcripts), and LLM-based synthesis, with explicit depends_on relations for safe parallel execution. Our contributions are threefold: (i) a reference-based plan evaluation framework with two complementary views—a metric-wise evaluator spanning seven dimensions (e.g., tool–prompt alignment, query adherence) and a one-shot evaluator that compares a candidate plan against a reference plan; (ii) a lineage-driven data curation methodology that uses an iterative evaluator→optimizer loop to refine initial plans into high-quality plan lineages while reducing manual effort; and (iii) a large-scale study of 14 LLMs across model families and sizes on their ability to generate step-by-step, executable, tool-assigned plans, evaluated with and without lineage in the prompt. Empirically, LLMs continue to struggle on compound queries and on plans longer than four steps; the highest aggregate metric-wise score is 84.8 (Claude-3-7-Sonnet), while the strongest one-shot A+ rate (Extremely Good or Very Good) is only 49.75% (o3-mini). Lineage yields mixed overall gains but improves several strong models and often helps step executability. Overall, our results expose persistent weaknesses in tool understanding—especially tool–prompt alignment and tool-usage completeness—and show that shorter, simpler plans remain markedly easier. The benchmark, evaluation framework, and findings provide a practical path for assessing and improving agentic planning with tools in enterprise question-answering settings. An anonymized dataset with human-annotated reference plans, plan lineages, and per-planner outputs for all 14 planners is available at the anonymous repository linked in the paper.
Synthetic data is increasingly critical for contact centers, where privacy constraints and data scarcity limit the availability of real conversations. However, generating synthetic dialogues that are realistic and useful for downstream applications remains challenging. In this work, we benchmark multiple generation strategies guided by structured supervision on call attributes (Intent Summaries, Topic Flows, and Quality Assurance (QA) Forms) across multiple languages. To test downstream utility, we evaluate synthetic transcripts on an automated quality assurance (AutoQA) task, finding that prompts optimized on real transcripts consistently outperform those optimized on synthetic transcripts. These results suggest that current synthetic transcripts fall short in capturing the full realism of real agent–customer interactions. To highlight these downstream gaps, we introduce a diagnostic evaluation framework comprising 17 metrics across four dimensions: (1) Emotional and Sentiment Arcs, (2) Linguistic Complexity, (3) Interaction Style, and (4) Conversational Properties. Our analysis shows that even with structured supervision, current generation strategies exhibit measurable deficiencies in sentiment fidelity, disfluency modeling, behavioral variation, and conversational realism. Together, these results highlight the importance of diagnostic, metric-driven evaluation for synthetic conversation generation intended for downstream applications.

2025

2024

This research investigates the impact of preference annotation acquisition methods on the performance of LLM alignment algorithms, including Direct Preference Optimization (DPO), Identity Preference Optimization (IPO), and Conservative DPO (cDPO), compared to Supervised Fine-Tuning (SFT) in NLP tasks. We analyze the influence of LLM and human-based preferences on algorithm performance, considering data volume and quality. Additionally, we assess DPO’s vulnerability to overfitting and IPO’s resilience against it, addressing four main research questions. Using the GAIR dataset and Zephyr-7b as the SFT model, we reveal unexpected negative outcomes. Specifically, DPO trained on LLM preferences outperforms human preferences, contrary to expectations. Moreover, there’s no correlation between preference data volume or quality and algorithm performance. Contrary to expectations, DPO shows no overfitting in both human and LLM preference datasets. Surprisingly, cDPO doesn’t fare better than DPO under flip noise. Our findings highlight the complexities of preference annotation methods and underscore the importance of scrutinizing negative results in NLP algorithm research.
Fine-tuning large language models (LLMs) with domain-specific instruction dataset has emerged as an effective method to enhance their domain-specific understanding. Yet, there is limited work that examines the core characteristics acquired during this process. In this study, we benchmark the fundamental characteristics learned by contact-center (CC) domain specific instruction fine-tuned LLMs with out-of-the-box (OOB) LLMs via probing tasks encompassing conversational, channel, and automatic speech recognition (ASR) properties. We explore different LLM architectures (Flan-T5 and Llama) and sizes (3B, 7B, 11B, 13B). Our findings reveal remarkable effectiveness of CC-LLMs on the in-domain downstream tasks, with improvement in response acceptability by over 48% compared to OOB-LLMs. However, we observe that the performance of probing classifiers are relatively similar and does not reflect the performance of in-domain downstream tasks. A similar observation is also noted on SentEval dataset that assess capabilities of models in terms of surface, syntactic, and semantic information through probing tasks. Our study challenges the premise that probing classifiers can reveal the fundamental characteristics learned by large language models and is reflective of the downstream task performance, via a case-study of LLMs tuned for contact center domain.

2023

Contact centers handle both chat and voice calls for the same domain. As part of their workflow, it is a standard practice to summarize the conversations once they conclude. A significant distinction between chat and voice communication lies in the presence of disfluencies in voice calls, such as repetitions, restarts, and replacements. These disfluencies are generally considered noise for downstream natural language understanding (NLU) tasks. While a separate summarization model for voice calls can be trained in addition to chat specific model for the same domain, it requires manual annotations for both the channels and adds complexity arising due to maintaining two models. Therefore, it’s crucial to investigate if a model trained on fluent data can handle disfluent data effectively. While previous research explored impact of disfluency on question-answering and intent detection, its influence on summarization is inadequately studied. Our experiments reveal up to 6.99-point degradation in Rouge-L score, along with reduced fluency, consistency, and relevance when a fluent-trained model handles disfluent data. Replacement disfluencies have the highest negative impact. To mitigate this, we examine Fused-Fine Tuning by training the model with a combination of fluent and disfluent data, resulting in improved performance on both public and real-life datasets. Our work highlights the significance of incorporating disfluency in training summarization models and its advantages in an industrial setting.