Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist

Martín Santillán Cooper; Zahra Ashktorab; Hyo Jin Do; Erik Miehling; Werner Geyer; Jasmina Gajcin; Elizabeth M. Daly; Qian Pan; Michael Desmond

Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist

Martín Santillán Cooper, Zahra Ashktorab, Hyo Jin Do, Erik Miehling, Werner Geyer, Jasmina Gajcin, Elizabeth M. Daly, Qian Pan, Michael Desmond

Abstract

We present a synthetic data generation tool integrated into EvalAssist. EvalAssist is a web-based application designed to assist human-centered evaluation of language model outputs by allowing users to refine LLM-as-a-Judge evaluation criteria. The synthetic data generation tool in EvalAssist is tailored for evaluation contexts and informed by findings from user studies with AI practitioners. Participants identified key pain points in current workflows including circularity risks (where models are judged by criteria derived by themselves), compounded bias (amplification of biases across multiple stages of a pipeline), and poor support for edge cases, and expressed a strong preference for real-world grounding and fine-grained control. In response, our tool supports flexible prompting, RAG-based grounding, persona diversity, and iterative generation workflows. We also incorporate features for quality assurance and edge case discovery.

Anthology ID:: 2025.emnlp-demos.1
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Ivan Habernal, Peter Schulam, Jörg Tiedemann
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–11
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-demos.1/
DOI:
Bibkey:
Cite (ACL):: Martín Santillán Cooper, Zahra Ashktorab, Hyo Jin Do, Erik Miehling, Werner Geyer, Jasmina Gajcin, Elizabeth M. Daly, Qian Pan, and Michael Desmond. 2025. Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 1–11, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist (Santillán Cooper et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-demos.1.pdf

PDF Cite Search Fix data