Kawin Mayilvaghanan
2026
Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System
Kawin Mayilvaghanan | Siddhant Gupta | Ayush Kumar
Findings of the Association for Computational Linguistics: ACL 2026
Kawin Mayilvaghanan | Siddhant Gupta | Ayush Kumar
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and consistent MASD shifts across confidence, positive, and improvement scores. Larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy. Contextual priming of historical performance induces the most severe degradations (CFR up to 16.4%), while implicit linguistic identity cues remain a persistent bias source. Finally, we analyze the efficacy of fairness-aware prompting, finding that explicit instructions yield only modest improvements in evaluative consistency. Our findings underscore the need for standardized fairness auditing pipelines prior to deploying LLMs in high-stakes workforce evaluation.
2025
PROPEL: Prompt Optimization with Expert Priors for Small and Medium-sized LLMs
Kawin Mayilvaghanan | Varun Nathan | Ayush Kumar
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Kawin Mayilvaghanan | Varun Nathan | Ayush Kumar
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Spot the BlindSpots: Systematic Identification and Quantification of Fine-Grained LLM Biases in Contact Center Call Summarization
Kawin Mayilvaghanan | Siddhant Gupta | Ayush Kumar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Kawin Mayilvaghanan | Siddhant Gupta | Ayush Kumar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Abstractive summarization is a core application in contact centers, where Large Language Models (LLMs) generate millions of summaries of call transcripts daily. Despite their apparent quality, it remains unclear whether LLMs systematically under- or over-attend to specific aspects of the transcript, potentially introducing biases in the generated summary. While prior work has examined social and positional biases, the specific forms of bias pertinent to contact center operations—which we term ‘Operational Bias’—have remained unexplored. To address this gap, we introduce BlindSpot, a framework built upon a taxonomy of 15 operational bias dimensions (e.g., disfluency, speaker, topic) for the identification and quantification of these biases. BlindSpot leverages an LLM as a zero-shot classifier to derive categorical distributions for each bias dimension in a pair of transcript and its summary. The bias is then quantified using two metrics: Fidelity Gap, measured as the Total Variation Distance (TVD) between distributions, and Coverage, defined as the percentage of source labels omitted. Using BlindSpot, we conduct an empirical study with 2500 real call transcripts and their summaries generated by 20 LLMs of varying scales and families (e.g., GPT, Llama, Claude). Our analysis reveals that biases are systemic and present across all evaluated models, regardless of size or family. We further report on bias mitigation via targeted prompting which measurably reduces bias across models.