Zahra Ashktorab
2024
Human-Centered Design Recommendations for LLM-as-a-judge
Qian Pan
|
Zahra Ashktorab
|
Michael Desmond
|
Martín Santillán Cooper
|
James Johnson
|
Rahul Nair
|
Elizabeth Daly
|
Werner Geyer
Proceedings of the 1st Human-Centered Large Language Modeling Workshop
Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human’s intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners’ preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.
Search
Co-authors
- Qian Pan 1
- Michael Desmond 1
- Martín Santillán Cooper 1
- James Johnson 1
- Rahul Nair 1
- show all...