Human-Centered Design Recommendations for LLM-as-a-judge

Qian Pan; Zahra Ashktorab; Michael Desmond; Martín Santillán Cooper; James Johnson; Rahul Nair; Elizabeth Daly; Werner Geyer

Human-Centered Design Recommendations for LLM-as-a-judge

Qian Pan, Zahra Ashktorab, Michael Desmond, Martín Santillán Cooper, James Johnson, Rahul Nair, Elizabeth Daly, Werner Geyer

Abstract

Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human’s intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners’ preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.

Anthology ID:: 2024.hucllm-1.2
Volume:: Proceedings of the 1st Human-Centered Large Language Modeling Workshop
Month:: August
Year:: 2024
Address:: TBD
Editors:: Nikita Soni, Lucie Flek, Ashish Sharma, Diyi Yang, Sara Hooker, H. Andrew Schwartz
Venues:: HuCLLM | WS
SIG:
Publisher:: ACL
Note:
Pages:: 16–29
Language:
URL:: https://aclanthology.org/2024.hucllm-1.2
DOI:
Bibkey:
Cite (ACL):: Qian Pan, Zahra Ashktorab, Michael Desmond, Martín Santillán Cooper, James Johnson, Rahul Nair, Elizabeth Daly, and Werner Geyer. 2024. Human-Centered Design Recommendations for LLM-as-a-judge. In Proceedings of the 1st Human-Centered Large Language Modeling Workshop, pages 16–29, TBD. ACL.
Cite (Informal):: Human-Centered Design Recommendations for LLM-as-a-judge (Pan et al., HuCLLM-WS 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2024.hucllm-1.2.pdf

PDF Cite Search