Zahra Ashktorab


2025

pdf bib
Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist
Martín Santillán Cooper | Zahra Ashktorab | Hyo Jin Do | Erik Miehling | Werner Geyer | Jasmina Gajcin | Elizabeth M. Daly | Qian Pan | Michael Desmond
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present a synthetic data generation tool integrated into EvalAssist. EvalAssist is a web-based application designed to assist human-centered evaluation of language model outputs by allowing users to refine LLM-as-a-Judge evaluation criteria. The synthetic data generation tool in EvalAssist is tailored for evaluation contexts and informed by findings from user studies with AI practitioners. Participants identified key pain points in current workflows including circularity risks (where models are judged by criteria derived by themselves), compounded bias (amplification of biases across multiple stages of a pipeline), and poor support for edge cases, and expressed a strong preference for real-world grounding and fine-grained control. In response, our tool supports flexible prompting, RAG-based grounding, persona diversity, and iterative generation workflows. We also incorporate features for quality assurance and edge case discovery.

2024

pdf bib
Human-Centered Design Recommendations for LLM-as-a-judge
Qian Pan | Zahra Ashktorab | Michael Desmond | Martín Santillán Cooper | James Johnson | Rahul Nair | Elizabeth Daly | Werner Geyer
Proceedings of the 1st Human-Centered Large Language Modeling Workshop

Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human’s intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners’ preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.