Jesus Rios
2025
Evaluating the Prompt Steerability of Large Language Models
Erik Miehling
|
Michael Desmond
|
Karthikeyan Natesan Ramamurthy
|
Elizabeth M. Daly
|
Kush R. Varshney
|
Eitan Farchi
|
Pierre Dognin
|
Jesus Rios
|
Djallel Bouneffouf
|
Miao Liu
|
Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model’s joint behavioral distribution can be shifted from its baseline. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited — due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.
Search
Fix data
Co-authors
- Djallel Bouneffouf 1
- Elizabeth M. Daly 1
- Michael Desmond 1
- Pierre Dognin 1
- Eitan Farchi 1
- show all...