Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist
Martín Santillán Cooper, Zahra Ashktorab, Hyo Jin Do, Erik Miehling, Werner Geyer, Jasmina Gajcin, Elizabeth M. Daly, Qian Pan, Michael Desmond
Abstract
We present a synthetic data generation tool integrated into EvalAssist. EvalAssist is a web-based application designed to assist human-centered evaluation of language model outputs by allowing users to refine LLM-as-a-Judge evaluation criteria. The synthetic data generation tool in EvalAssist is tailored for evaluation contexts and informed by findings from user studies with AI practitioners. Participants identified key pain points in current workflows including circularity risks (where models are judged by criteria derived by themselves), compounded bias (amplification of biases across multiple stages of a pipeline), and poor support for edge cases, and expressed a strong preference for real-world grounding and fine-grained control. In response, our tool supports flexible prompting, RAG-based grounding, persona diversity, and iterative generation workflows. We also incorporate features for quality assurance and edge case discovery.- Anthology ID:
- 2025.emnlp-demos.1
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Ivan Habernal, Peter Schulam, Jörg Tiedemann
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1–11
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-demos.1/
- DOI:
- Cite (ACL):
- Martín Santillán Cooper, Zahra Ashktorab, Hyo Jin Do, Erik Miehling, Werner Geyer, Jasmina Gajcin, Elizabeth M. Daly, Qian Pan, and Michael Desmond. 2025. Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 1–11, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist (Santillán Cooper et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-demos.1.pdf