Guy Horowitz
2025
Generating Q&A Benchmarks for RAG Evaluation in Enterprise Settings
Simone Filice
|
Guy Horowitz
|
David Carmel
|
Zohar Karnin
|
Liane Lewin-Eytan
|
Yoelle Maarek
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
We introduce DataMorgana, a tool for generating synthetic Q&A benchmarks tailored to RAG applications in enterprise settings. DataMorgana enables customization of the generated benchmark according to the expected diverse traffic of the RAG application. It allows for specifying question types and their associated distribution via a lightweight configuration mechanism. We demonstrate via a series of quantitative and qualitative experiments that DataMorgana surpasses existing tools in terms of lexical, syntactic, and semantic diversity of the generated benchmark while maintaining high quality. We run our experiments over domain-specific and general-knowledge public datasets, as well as two private datasets from governmental RAG applications: one for citizens and the other for government employees. The private datasets have been shared with us by AI71, an AI company, which has integrated DataMorgana into its offerings. In addition, DataMorgana has been offered to about 150 researchers worldwide as part of the SIGIR’2025 LiveRAG Challenge held in Spring 2025.
2023
Consistent Text Categorization using Data Augmentation in e-Commerce
Noa Avigdor
|
Guy Horowitz
|
Ariel Raviv
|
Stav Yanovsky Daye
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
The categorization of massive e-Commerce data is a crucial, well-studied task, which is prevalent in industrial settings. In this work, we aim to improve an existing product categorization model that is already in use by a major web company, serving multiple applications. At its core, the product categorization model is a text classification model that takes a product title as an input and outputs the most suitable category out of thousands of available candidates. Upon a closer inspection, we found inconsistencies in the labeling of similar items. For example, minor modifications of the product title pertaining to colors or measurements majorly impacted the model’s output. This phenomenon can negatively affect downstream recommendation or search applications, leading to a sub-optimal user experience. To address this issue, we propose a new framework for consistent text categorization. Our goal is to improve the model’s consistency while maintaining its production-level performance. We use a semi-supervised approach for data augmentation and presents two different methods for utilizing unlabeled samples. One method relies directly on existing catalogs, while the other uses a generative model. We compare the pros and cons of each approach and present our experimental results.
Search
Fix author
Co-authors
- Noa Avigdor 1
- David Carmel 1
- Simone Filice 1
- Zohar Karnin 1
- Liane Lewin-Eytan 1
- show all...
Venues
- acl2