2025
pdf
bib
abs
Granite Guardian: Comprehensive LLM Safeguarding
Inkit Padhi
|
Manish Nagireddy
|
Giandomenico Cornacchia
|
Subhajit Chaudhury
|
Tejaswini Pedapati
|
Pierre Dognin
|
Keerthiram Murugesan
|
Erik Miehling
|
Martín Santillán Cooper
|
Kieran Fraser
|
Giulio Zizzo
|
Muhammad Zaid Hameed
|
Mark Purcell
|
Michael Desmond
|
Qian Pan
|
Inge Vejsbjerg
|
Elizabeth M. Daly
|
Michael Hind
|
Werner Geyer
|
Ambrish Rawat
|
Kush R. Varshney
|
Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
The deployment of language models in real-world applications exposes users to various risks, including hallucinations and harmful or unethical content. These challenges highlight the urgent need for robust safeguards to ensure safe and responsible AI. To address this, we introduce Granite Guardian, a suite of advanced models designed to detect and mitigate risks associated with prompts and responses, enabling seamless integration with any large language model (LLM). Unlike existing open-source solutions, our Granite Guardian models provide comprehensive coverage across a wide range of risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related issues such as context relevance, groundedness, and answer accuracy in retrieval-augmented generation (RAG) scenarios. Trained on a unique dataset combining diverse human annotations and synthetic data, Granite Guardian excels in identifying risks often overlooked by traditional detection systems, particularly jailbreak attempts and RAG-specific challenges. https://github.com/ibm-granite/granite-guardian
pdf
bib
abs
DAMAGeR: Deploying Automatic and Manual Approaches to GenAI Red-teaming
Manish Nagireddy
|
Michael Feffer
|
Ioana Baldini
Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts)
In this tutorial, we will review and apply current automatic and manual red-teaming techniques for GenAI models(including LLMs and multimodal models). In doing so, we aim to emphasize the importance of using a mixture of techniques and establishing a balance between automatic and manual approaches. Lastly, we aim to engage tutorial participants in live red-teaming activities to collaboratively learn impactful red-teaming strategies and share insights.
2024
pdf
bib
abs
Value Alignment from Unstructured Text
Inkit Padhi
|
Karthikeyan Natesan Ramamurthy
|
Prasanna Sattigeri
|
Manish Nagireddy
|
Pierre Dognin
|
Kush R. Varshney
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Aligning large language models (LLMs) to value systems has emerged as a significant area of research within the fields of AI and NLP. Currently, this alignment process relies on the availability of high-quality supervised and preference data, which can be both time-consuming and expensive to curate or annotate. In this paper, we introduce a systematic end-to-end methodology for aligning LLMs to the implicit and explicit values represented in unstructured text data. Our proposed approach leverages the use of scalable synthetic data generation techniques to effectively align the model to the values present in the unstructured data. Through two distinct use-cases, we demonstrate the efficiency of our methodology on the Mistral-7B-Instruct model. Our approach credibly aligns LLMs to the values embedded within documents, and shows improved performance against other approaches, as quantified through the use of automatic metrics and win rates.
pdf
bib
abs
Language Models in Dialogue: Conversational Maxims for Human-AI Interactions
Erik Miehling
|
Manish Nagireddy
|
Prasanna Sattigeri
|
Elizabeth M. Daly
|
David Piorkowski
|
John T. Richards
Findings of the Association for Computational Linguistics: EMNLP 2024
Modern language models, while sophisticated, exhibit some inherent shortcomings, particularly in conversational settings. We claim that many of the observed shortcomings can be attributed to violation of one or more conversational principles. By drawing upon extensive research from both the social science and AI communities, we propose a set of maxims – quantity, quality, relevance, manner, benevolence, and transparency – for describing effective human-AI conversation. We first justify the applicability of the first four maxims (from Grice) in the context of human-AI interactions. We then argue that two new maxims, benevolence (concerning the generation of, and engagement with, harmful content) and transparency (concerning recognition of one’s knowledge boundaries, operational constraints, and intents), are necessary for addressing behavior unique to modern human-AI interactions. We evaluate the degree to which various language models are able to understand these maxims and find that models possess an internal prioritization of principles that can significantly impact accurate interpretability of the maxims.