Jithendra Vepa


2024

pdf
Improving Hierarchical Text Clustering with LLM-guided Multi-view Cluster Representation
Anup Pattnaik | Cijo George | Rishabh Kumar Tripathi | Sasanka Vutla | Jithendra Vepa
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

In this work, we present an approach that introduces different perspectives or views to improve the quality of hierarchical clustering of interaction drivers in a contact center. Specifically, we present a multi-stage approach that introduces LLM-guided multi-view cluster representation that significantly improves the quality of generated clusters. Our approach improves average Silhouette Score by upto 70% and Human Preference Scores by 36.7% for top-level clusters compared to standard agglomerative clustering for the given business use-case. We also present how the proposed approach can be adapted to cater to a standard non-hierarchical clustering use-cases where it achieves state-of-the-art performance on public datasets based on NMI and ACC scores, with minimal number of LLM queries compared to the current state-of-the-art approaches. Moreover, we apply our technique to generate two new labeled datasets for hierarchical clustering. We open-source these labeled datasets, validated and corrected by domain experts, for the benefit of the research community.

pdf
Probing the Depths of Language Models’ Contact-Center Knowledge for Quality Assurance
Digvijay Anil Ingle | Aashraya Sachdeva | Surya Prakash Sahu | Mayank Sati | Cijo George | Jithendra Vepa
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Recent advancements in large Language Models (LMs) have significantly enhanced their capabilities across various domains, including natural language understanding and generation. In this paper, we investigate the application of LMs to the specialized task of contact-center Quality Assurance (QA), which involves evaluating conversations between human agents and customers. This task requires both sophisticated linguistic understanding and deep domain knowledge. We conduct a comprehensive assessment of eight LMs, revealing that larger models, such as Claude-3.5-Sonnet, exhibit superior performance in comprehending contact-center conversations. We introduce methodologies to transfer this domain-specific knowledge to smaller models by leveraging evaluation plans generated by more knowledgeable models, with optional human-in-the-loop refinement to enhance the capabilities of smaller models. Notably, our experimental results demonstrate an improvement of up to 18.95% in Macro F1 on an in-house QA dataset. Our findings emphasize the importance of evaluation plans in guiding reasoning and highlight the potential of AI-assisted tools to advance objective, consistent, and scalable agent evaluation processes in contact centers.

2023

pdf
Investigating the Role and Impact of Disfluency on Summarization
Varun Nathan | Ayush Kumar | Jithendra Vepa
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

Contact centers handle both chat and voice calls for the same domain. As part of their workflow, it is a standard practice to summarize the conversations once they conclude. A significant distinction between chat and voice communication lies in the presence of disfluencies in voice calls, such as repetitions, restarts, and replacements. These disfluencies are generally considered noise for downstream natural language understanding (NLU) tasks. While a separate summarization model for voice calls can be trained in addition to chat specific model for the same domain, it requires manual annotations for both the channels and adds complexity arising due to maintaining two models. Therefore, it’s crucial to investigate if a model trained on fluent data can handle disfluent data effectively. While previous research explored impact of disfluency on question-answering and intent detection, its influence on summarization is inadequately studied. Our experiments reveal up to 6.99-point degradation in Rouge-L score, along with reduced fluency, consistency, and relevance when a fluent-trained model handles disfluent data. Replacement disfluencies have the highest negative impact. To mitigate this, we examine Fused-Fine Tuning by training the model with a combination of fluent and disfluent data, resulting in improved performance on both public and real-life datasets. Our work highlights the significance of incorporating disfluency in training summarization models and its advantages in an industrial setting.

2022

pdf
Investigating the Characteristics of a Transformer in a Few-Shot Setup: Does Freezing Layers in RoBERTa Help?
Digvijay Ingle | Rishabh Tripathi | Ayush Kumar | Kevin Patel | Jithendra Vepa
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Transformer based language models have been widely adopted by industrial and research organisations in developing machine learning applications in the presence of limited annotated data. While these models show remarkable results, their functioning in few-shot settings is still poorly understood. Hence, we perform an investigative study to understand the characteristics of such models fine-tuned in few-shot setups. Specifically, we compare the intermediate layer representations obtained from a few-shot model and a pre-trained language model. We observe that pre-trained and few-shot models show similar representations over initial layers, whereas the later layers show a stark deviation. Based on these observations, we propose to freeze the initial Transformer layers to fine-tune the model in a constrained text classification setup with K annotated data points per class, where K ranges from 8 to 64. In our experiments across six benchmark sentence classification tasks, we discover that freezing initial 50% Transformer layers not only reduces training time but also surprisingly improves Macro F1 (upto 8%) when compared to fully trainable layers in few-shot setup. We also observe that this idea of layer freezing can very well be generalized to state-of-the-art few-shot text classification techniques, like DNNC and LM-BFF, leading to significant reduction in training time while maintaining comparable performance.

pdf
Partially Humanizing Weak Supervision: Towards a Better Low Resource Pipeline for Spoken Language Understanding
Ayush Kumar | Rishabh Tripathi | Jithendra Vepa
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)

Weak Supervised Learning (WSL) is a popular technique to develop machine learning models in absence of labeled training data. WSL involves training over noisy labels which are traditionally obtained from hand-engineered semantic rules and task-specific pre-trained models. Such rules offer limited coverage and generalization over tasks. On the other hand, pre-trained models are available only for limited tasks. Thus, obtaining weak labels is a bottleneck in weak supervised learning. In this work, we propose to utilize the prompting paradigm to generate weak labels for the underlying tasks. We show that task-agnostic prompts are generalizable and can be used to obtain noisy labels for different Spoken Language Understanding (SLU) tasks such as sentiment classification, disfluency detection and emotion classification. These prompts can additionally be updated with human-in-the-loop to add task-specific contexts, thus providing flexibility to design task-specific prompts. Our proposed WSL pipeline outperforms other competitive low-resource benchmarks on zero and few-shot learning by more than 4% on Macro-F1 and a conventional rule-based WSL baseline by more than 5% across all the benchmark datasets. We demonstrate that prompt-based methods save nearly 75% of time in a weak-supervised framework and generate more reliable labels for the above SLU tasks and thus can be used as a universal strategy to obtain weak labels.

2021

pdf
What BERT Based Language Model Learns in Spoken Transcripts: An Empirical Study
Ayush Kumar | Mukuntha Narayanan Sundararaman | Jithendra Vepa
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Language Models (LMs) have been ubiquitously leveraged in various tasks including spoken language understanding (SLU). Spoken language requires careful understanding of speaker interactions, dialog states and speech induced multimodal behaviors to generate a meaningful representation of the conversation. In this work, we propose to dissect SLU into three representative properties: conversational (disfluency, pause, overtalk), channel (speaker-type, turn-tasks) and ASR (insertion, deletion, substitution). We probe BERT based language models (BERT, RoBERTa) trained on spoken transcripts to investigate its ability to understand multifarious properties in absence of any speech cues. Empirical results indicate that LM is surprisingly good at capturing conversational properties such as pause prediction and overtalk detection from lexical tokens. On the downsides, the LM scores low on turn-tasks and ASR errors predictions. Additionally, pre-training the LM on spoken transcripts restrain its linguistic understanding. Finally, we establish the efficacy and transferability of the mentioned properties on two benchmark datasets: Switchboard Dialog Act and Disfluency datasets.