2025
pdf
bib
abs
Evaluating the Prompt Steerability of Large Language Models
Erik Miehling
|
Michael Desmond
|
Karthikeyan Natesan Ramamurthy
|
Elizabeth M. Daly
|
Kush R. Varshney
|
Eitan Farchi
|
Pierre Dognin
|
Jesus Rios
|
Djallel Bouneffouf
|
Miao Liu
|
Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model’s joint behavioral distribution can be shifted from its baseline. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited — due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.
pdf
bib
abs
Granite Guardian: Comprehensive LLM Safeguarding
Inkit Padhi
|
Manish Nagireddy
|
Giandomenico Cornacchia
|
Subhajit Chaudhury
|
Tejaswini Pedapati
|
Pierre Dognin
|
Keerthiram Murugesan
|
Erik Miehling
|
Martín Santillán Cooper
|
Kieran Fraser
|
Giulio Zizzo
|
Muhammad Zaid Hameed
|
Mark Purcell
|
Michael Desmond
|
Qian Pan
|
Inge Vejsbjerg
|
Elizabeth M. Daly
|
Michael Hind
|
Werner Geyer
|
Ambrish Rawat
|
Kush R. Varshney
|
Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
The deployment of language models in real-world applications exposes users to various risks, including hallucinations and harmful or unethical content. These challenges highlight the urgent need for robust safeguards to ensure safe and responsible AI. To address this, we introduce Granite Guardian, a suite of advanced models designed to detect and mitigate risks associated with prompts and responses, enabling seamless integration with any large language model (LLM). Unlike existing open-source solutions, our Granite Guardian models provide comprehensive coverage across a wide range of risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related issues such as context relevance, groundedness, and answer accuracy in retrieval-augmented generation (RAG) scenarios. Trained on a unique dataset combining diverse human annotations and synthetic data, Granite Guardian excels in identifying risks often overlooked by traditional detection systems, particularly jailbreak attempts and RAG-specific challenges. https://github.com/ibm-granite/granite-guardian
2024
pdf
bib
abs
Human-Centered Design Recommendations for LLM-as-a-judge
Qian Pan
|
Zahra Ashktorab
|
Michael Desmond
|
Martín Santillán Cooper
|
James Johnson
|
Rahul Nair
|
Elizabeth Daly
|
Werner Geyer
Proceedings of the 1st Human-Centered Large Language Modeling Workshop
Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human’s intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners’ preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.
2020
pdf
bib
abs
Label Noise in Context
Michael Desmond
|
Catherine Finegan-Dollak
|
Jeff Boston
|
Matt Arnold
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
Label noise—incorrectly or ambiguously labeled training examples—can negatively impact model performance. Although noise detection techniques have been around for decades, practitioners rarely apply them, as manual noise remediation is a tedious process. Examples incorrectly flagged as noise waste reviewers’ time, and correcting label noise without guidance can be difficult. We propose LNIC, a noise-detection method that uses an example’s neighborhood within the training set to (a) reduce false positives and (b) provide an explanation as to why the ex- ample was flagged as noise. We demonstrate on several short-text classification datasets that LNIC outperforms the state of the art on measures of precision and F0.5-score. We also show how LNIC’s training set context helps a reviewer to understand and correct label noise in a dataset. The LNIC tool lowers the barriers to label noise remediation, increasing its utility for NLP practitioners.