Kush R. Varshney
2026
AI Steerability 360: A Toolkit for Steering Large Language Models
Erik Miehling | Karthikeyan Natesan Ramamurthy | Praveen Venkateswaran | Ching-Yun Ko | Pierre Dognin | Moninder Singh | Tejaswini Pedapati | Avinash Balakrishnan | Matthew Riemer | Dennis Wei | Inge Vejsbjerg | Elizabeth M. Daly | Kush R. Varshney
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Erik Miehling | Karthikeyan Natesan Ramamurthy | Praveen Venkateswaran | Ching-Yun Ko | Pierre Dognin | Moninder Singh | Tejaswini Pedapati | Avinash Balakrishnan | Matthew Riemer | Dennis Wei | Inge Vejsbjerg | Elizabeth M. Daly | Kush R. Varshney
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
The AI Steerability 360 toolkit is an extensible, open-source Python library for steering LLMs. Steering abstractions are designed around four model control surfaces: input (modification of the prompt), structural (modification of the model’s weights or architecture), state (modification of the model’s activations and attentions), and output (modification of the decoding or generation process). Steering methods exert control on the model through a common interface, termed a steering pipeline, which additionally allows for the composition of multiple steering methods. Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task). The functionality provided by the toolkit significantly lowers the barrier to developing and comprehensively evaluating steering methods. The toolkit is Hugging Face native and is released under an Apache 2.0 license at https://github.com/IBM/AISteer360.
2025
Evaluating the Prompt Steerability of Large Language Models
Erik Miehling | Michael Desmond | Karthikeyan Natesan Ramamurthy | Elizabeth M. Daly | Kush R. Varshney | Eitan Farchi | Pierre Dognin | Jesus Rios | Djallel Bouneffouf | Miao Liu | Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Erik Miehling | Michael Desmond | Karthikeyan Natesan Ramamurthy | Elizabeth M. Daly | Kush R. Varshney | Eitan Farchi | Pierre Dognin | Jesus Rios | Djallel Bouneffouf | Miao Liu | Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model’s joint behavioral distribution can be shifted from its baseline. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited — due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.
Granite Guardian: Comprehensive LLM Safeguarding
Inkit Padhi | Manish Nagireddy | Giandomenico Cornacchia | Subhajit Chaudhury | Tejaswini Pedapati | Pierre Dognin | Keerthiram Murugesan | Erik Miehling | Martín Santillán Cooper | Kieran Fraser | Giulio Zizzo | Muhammad Zaid Hameed | Mark Purcell | Michael Desmond | Qian Pan | Inge Vejsbjerg | Elizabeth M. Daly | Michael Hind | Werner Geyer | Ambrish Rawat | Kush R. Varshney | Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Inkit Padhi | Manish Nagireddy | Giandomenico Cornacchia | Subhajit Chaudhury | Tejaswini Pedapati | Pierre Dognin | Keerthiram Murugesan | Erik Miehling | Martín Santillán Cooper | Kieran Fraser | Giulio Zizzo | Muhammad Zaid Hameed | Mark Purcell | Michael Desmond | Qian Pan | Inge Vejsbjerg | Elizabeth M. Daly | Michael Hind | Werner Geyer | Ambrish Rawat | Kush R. Varshney | Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
The deployment of language models in real-world applications exposes users to various risks, including hallucinations and harmful or unethical content. These challenges highlight the urgent need for robust safeguards to ensure safe and responsible AI. To address this, we introduce Granite Guardian, a suite of advanced models designed to detect and mitigate risks associated with prompts and responses, enabling seamless integration with any large language model (LLM). Unlike existing open-source solutions, our Granite Guardian models provide comprehensive coverage across a wide range of risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related issues such as context relevance, groundedness, and answer accuracy in retrieval-augmented generation (RAG) scenarios. Trained on a unique dataset combining diverse human annotations and synthetic data, Granite Guardian excels in identifying risks often overlooked by traditional detection systems, particularly jailbreak attempts and RAG-specific challenges. https://github.com/ibm-granite/granite-guardian
2024
Value Alignment from Unstructured Text
Inkit Padhi | Karthikeyan Natesan Ramamurthy | Prasanna Sattigeri | Manish Nagireddy | Pierre Dognin | Kush R. Varshney
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Inkit Padhi | Karthikeyan Natesan Ramamurthy | Prasanna Sattigeri | Manish Nagireddy | Pierre Dognin | Kush R. Varshney
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Aligning large language models (LLMs) to value systems has emerged as a significant area of research within the fields of AI and NLP. Currently, this alignment process relies on the availability of high-quality supervised and preference data, which can be both time-consuming and expensive to curate or annotate. In this paper, we introduce a systematic end-to-end methodology for aligning LLMs to the implicit and explicit values represented in unstructured text data. Our proposed approach leverages the use of scalable synthetic data generation techniques to effectively align the model to the values present in the unstructured data. Through two distinct use-cases, we demonstrate the efficiency of our methodology on the Mistral-7B-Instruct model. Our approach credibly aligns LLMs to the values embedded within documents, and shows improved performance against other approaches, as quantified through the use of automatic metrics and win rates.
Search
Fix author
Co-authors
- Pierre Dognin 4
- Elizabeth M. Daly 3
- Erik Miehling 3
- Karthikeyan Natesan Ramamurthy 3
- Prasanna Sattigeri 3
- Michael Desmond 2
- Manish Nagireddy 2
- Inkit Padhi 2
- Tejaswini Pedapati 2
- Inge Vejsbjerg 2
- Avinash Balakrishnan 1
- Djallel Bouneffouf 1
- Subhajit Chaudhury 1
- Giandomenico Cornacchia 1
- Eitan Farchi 1
- Kieran Fraser 1
- Werner Geyer 1
- Muhammad Zaid Hameed 1
- Michael Hind 1
- Ching-Yun Ko 1
- Miao Liu 1
- Keerthiram Murugesan 1
- Qian Pan 1
- Mark Purcell 1
- Ambrish Rawat 1
- Matthew Riemer 1
- Jesus Rios 1
- Martín Santillán Cooper 1
- Moninder Singh 1
- Praveen Venkateswaran 1
- Dennis Wei 1
- Giulio Zizzo 1