Generating Q&A Benchmarks for RAG Evaluation in Enterprise Settings

Simone Filice; Guy Horowitz; David Carmel; Zohar Karnin; Liane Lewin-Eytan; Yoelle Maarek

Generating Q&A Benchmarks for RAG Evaluation in Enterprise Settings

Simone Filice, Guy Horowitz, David Carmel, Zohar Karnin, Liane Lewin-Eytan, Yoelle Maarek

Abstract

We introduce DataMorgana, a tool for generating synthetic Q&A benchmarks tailored to RAG applications in enterprise settings. DataMorgana enables customization of the generated benchmark according to the expected diverse traffic of the RAG application. It allows for specifying question types and their associated distribution via a lightweight configuration mechanism. We demonstrate via a series of quantitative and qualitative experiments that DataMorgana surpasses existing tools in terms of lexical, syntactic, and semantic diversity of the generated benchmark while maintaining high quality. We run our experiments over domain-specific and general-knowledge public datasets, as well as two private datasets from governmental RAG applications: one for citizens and the other for government employees. The private datasets have been shared with us by AI71, an AI company, which has integrated DataMorgana into its offerings. In addition, DataMorgana has been offered to about 150 researchers worldwide as part of the SIGIR’2025 LiveRAG Challenge held in Spring 2025.

Anthology ID:: 2025.acl-industry.33
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Georg Rehm, Yunyao Li
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 469–484
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.acl-industry.33/
DOI:
Bibkey:
Cite (ACL):: Simone Filice, Guy Horowitz, David Carmel, Zohar Karnin, Liane Lewin-Eytan, and Yoelle Maarek. 2025. Generating Q&A Benchmarks for RAG Evaluation in Enterprise Settings. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 469–484, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Generating Q&A Benchmarks for RAG Evaluation in Enterprise Settings (Filice et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.acl-industry.33.pdf

PDF Cite Search Fix data