Human-Centered Design Recommendations for LLM-as-a-judge
Qian Pan, Zahra Ashktorab, Michael Desmond, Martín Santillán Cooper, James Johnson, Rahul Nair, Elizabeth Daly, Werner Geyer
Abstract
Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human’s intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners’ preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.- Anthology ID:
- 2024.hucllm-1.2
- Volume:
- Proceedings of the 1st Human-Centered Large Language Modeling Workshop
- Month:
- August
- Year:
- 2024
- Address:
- TBD
- Editors:
- Nikita Soni, Lucie Flek, Ashish Sharma, Diyi Yang, Sara Hooker, H. Andrew Schwartz
- Venues:
- HuCLLM | WS
- SIG:
- Publisher:
- ACL
- Note:
- Pages:
- 16–29
- Language:
- URL:
- https://aclanthology.org/2024.hucllm-1.2
- DOI:
- Cite (ACL):
- Qian Pan, Zahra Ashktorab, Michael Desmond, Martín Santillán Cooper, James Johnson, Rahul Nair, Elizabeth Daly, and Werner Geyer. 2024. Human-Centered Design Recommendations for LLM-as-a-judge. In Proceedings of the 1st Human-Centered Large Language Modeling Workshop, pages 16–29, TBD. ACL.
- Cite (Informal):
- Human-Centered Design Recommendations for LLM-as-a-judge (Pan et al., HuCLLM-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.hucllm-1.2.pdf