2025
pdf
bib
abs
Multi-Level Explanations for Generative Language Models
Lucas Monteiro Paes
|
Dennis Wei
|
Hyo Jin Do
|
Hendrik Strobelt
|
Ronny Luss
|
Amit Dhurandhar
|
Manish Nagireddy
|
Karthikeyan Natesan Ramamurthy
|
Prasanna Sattigeri
|
Werner Geyer
|
Soumya Ghosh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite the increasing use of large language models (LLMs) for context-grounded tasks like summarization and question-answering, understanding what makes an LLM produce a certain response is challenging. We propose Multi-Level Explanations for Generative Language Models (MExGen), a technique to provide explanations for context-grounded text generation. MExGen assigns scores to parts of the context to quantify their influence on the model’s output. It extends attribution methods like LIME and SHAP to LLMs used in context-grounded tasks where (1) inference cost is high, (2) input text is long, and (3) the output is text. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and question answering. The results show that our framework can provide more faithful explanations of generated output than available alternatives, including LLM self-explanations. We open-source code for MExGen as part of the ICX360 toolkit: https://github.com/IBM/ICX360.
pdf
bib
abs
Conceptual Diagnostics for Knowledge Graphs and Large Language Models
Rosario Uceda Sosa
|
Maria Chang
|
Karthikeyan Natesan Ramamurthy
|
Moninder Singh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Industrial applications pose heightened requirements for consistency and reliability of large language models (LLMs). While LLMs are being tested with increasingly complex reasoning tasks, we argue that much can be learned via diagnostic tools that probe a fundamentally basic type of reasoning: conceptual consistency, e.g., a rule applying to “all surgeons” must also apply to “cardiac surgeons” since a cardiac surgeon is a type of surgeon. In this emerging industry track submission, we propose a method that takes concept hierarchies from a knowledge graph (KG) and automatically generates benchmarks that test conceptual consistency in LLMs. We develop a multi-domain benchmark that reveals rates of conceptual inconsistencies in several state of the art LLMs. Additionally, we use measured levels of inconsistency and disagreement in LLMs to find potentially problematic subgraphs in the reference KG. As such, it offers a scalable complement to symbolic curation, maintenance, and refinement of knowledge graphs, which is a critical activity in KG-based industrial applications.
pdf
bib
abs
Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents
Ivoline C. Ngong
|
Swanand Ravindra Kadhe
|
Hao Wang
|
Keerthiram Murugesan
|
Justin D. Weisz
|
Amit Dhurandhar
|
Karthikeyan Natesan Ramamurthy
Findings of the Association for Computational Linguistics: ACL 2025
Conversational agents are increasingly woven into individuals’ personal lives, yet users often underestimate the privacy risks associated with them. The moment users share information with these agents —such as large language models (LLMs)— their private information becomes vulnerable to exposure. In this paper, we characterize the notion of contextual privacy for user interactions with LLM-based Conversational Agents (LCAs). It aims to minimize privacy risks by ensuring that users (sender) disclose only information that is both relevant and necessary for achieving their intended goals when interacting with LCAs (untrusted receivers). Through a formative design user study, we observe how even “privacy-conscious” users inadvertently reveal sensitive information through indirect disclosures. Based on insights from this study, we propose a locally deployable framework that operates between users and LCAs, identifying and reformulating out-of-context information in user prompts. Our evaluation using examples from ShareGPT shows that lightweight models can effectively implement this framework, achieving strong gains in contextual privacy while preserving the user’s intended interaction goals. Notably, about 76% of participants in our human evaluation preferred the reformulated prompts over the original ones, validating the usability and effectiveness of contextual privacy in our proposed framework. We open source the code at https://github.com/IBM/contextual-privacy-LLM.
pdf
bib
abs
Evaluating the Prompt Steerability of Large Language Models
Erik Miehling
|
Michael Desmond
|
Karthikeyan Natesan Ramamurthy
|
Elizabeth M. Daly
|
Kush R. Varshney
|
Eitan Farchi
|
Pierre Dognin
|
Jesus Rios
|
Djallel Bouneffouf
|
Miao Liu
|
Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model’s joint behavioral distribution can be shifted from its baseline. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited — due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.
2024
pdf
bib
abs
Value Alignment from Unstructured Text
Inkit Padhi
|
Karthikeyan Natesan Ramamurthy
|
Prasanna Sattigeri
|
Manish Nagireddy
|
Pierre Dognin
|
Kush R. Varshney
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Aligning large language models (LLMs) to value systems has emerged as a significant area of research within the fields of AI and NLP. Currently, this alignment process relies on the availability of high-quality supervised and preference data, which can be both time-consuming and expensive to curate or annotate. In this paper, we introduce a systematic end-to-end methodology for aligning LLMs to the implicit and explicit values represented in unstructured text data. Our proposed approach leverages the use of scalable synthetic data generation techniques to effectively align the model to the values present in the unstructured data. Through two distinct use-cases, we demonstrate the efficiency of our methodology on the Mistral-7B-Instruct model. Our approach credibly aligns LLMs to the values embedded within documents, and shows improved performance against other approaches, as quantified through the use of automatic metrics and win rates.
pdf
bib
abs
Ranking Large Language Models without Ground Truth
Amit Dhurandhar
|
Rahul Nair
|
Moninder Singh
|
Elizabeth Daly
|
Karthikeyan Natesan Ramamurthy
Findings of the Association for Computational Linguistics: ACL 2024
Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly we propose two methods to rank LLMs. In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover true rankings without reference data. This points to a viable low-resource mechanism for practical use.
2022
pdf
bib
abs
Your fairness may vary: Pretrained language model fairness in toxic text classification
Ioana Baldini
|
Dennis Wei
|
Karthikeyan Natesan Ramamurthy
|
Moninder Singh
|
Mikhail Yurochkin
Findings of the Association for Computational Linguistics: ACL 2022
The popularity of pretrained language models in natural language processing systems calls for a careful evaluation of such models in down-stream tasks, which have a higher potential for societal impact. The evaluation of such systems usually focuses on accuracy measures. Our findings in this paper call for attention to be paid to fairness measures as well. Through the analysis of more than a dozen pretrained language models of varying sizes on two toxic text classification tasks (English), we demonstrate that focusing on accuracy measures alone can lead to models with wide variation in fairness characteristics. Specifically, we observe that fairness can vary even more than accuracy with increasing training data size and different random initializations. At the same time, we find that little of the fairness variation is explained by model size, despite claims in the literature. To improve model fairness without retraining, we show that two post-processing methods developed for structured, tabular data can be successfully applied to a range of pretrained language models. Warning: This paper contains samples of offensive text.