2025
pdf
bib
abs
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
Eliya Habba
|
Noam Dahan
|
Gili Lior
|
Gabriel Stanovsky
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible – working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. All resources, including the Python API, source code, user-friendly web interface, and demonstration video, are available at: https://eliyahabba.github.io/PromptSuite/.
pdf
bib
abs
JSON Whisperer: Efficient JSON Editing with LLMs
Sarel Duanis
|
Asnat Greenstein-Messica
|
Eliya Habba
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) can modify JSON documents through natural language commands, but current approaches regenerate entire structures for each edit, resulting in computational inefficiency. We present JSON Whisperer, a framework that enables LLMs to generate RFC 6902 diff patches-expressing only the necessary modifications-rather than complete documents.We identify two key challenges in patch-based editing: (1) LLMs often miss related updates when generating isolated patches, and (2) array manipulations require tracking index shifts across operations, which LLMs handle poorly. To address these issues, we introduce EASE (Explicitly Addressed Sequence Encoding), which transforms arrays into dictionaries with stable keys, eliminating index arithmetic complexities.Our evaluation shows that patch generation with EASE reduces token usage by 31% while maintaining edit quality within 5% of full regeneration with particular gains for complex instructions and list manipulations.
pdf
bib
abs
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
Eliya Habba
|
Ofir Arviv
|
Itay Itzhak
|
Yotam Perlitz
|
Elron Bandel
|
Leshem Choshen
|
Michal Shmueli-Scheuer
|
Gabriel Stanovsky
Findings of the Association for Computational Linguistics: ACL 2025
Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more at: https://slab-nlp.github.io/DOVE
pdf
bib
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Ofir Arviv
|
Miruna Clinciu
|
Kaustubh Dhole
|
Rotem Dror
|
Sebastian Gehrmann
|
Eliya Habba
|
Itay Itzhak
|
Simon Mille
|
Yotam Perlitz
|
Enrico Santus
|
João Sedoc
|
Michal Shmueli Scheuer
|
Gabriel Stanovsky
|
Oyvind Tafjord
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)