Evaluating LLMs with Multiple Problems at once

Zhengxiang Wang, Jordan Kodner, Owen Rambow


Abstract
This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once, a paradigm we call multi-problem evaluation (MPE). Unlike conventional single-problem evaluation, where a prompt presents a single problem and expects one specific answer, MPE places multiple problems together in a single prompt and assesses how well an LLM answers all these problems in a single output. Leveraging 6 classification and 12 reasoning benchmarks that already exist, we introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts. We experiment with a total of 13 LLMs from 5 model families on ZeMPE to present a comprehensive and systematic MPE. Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short. In addition, we perform in-depth further analyses and explore model-level factors that may enable multiple problem handling capabilities in LLMs. We release our corpus and code to facilitate future research.
Anthology ID:
2025.gem-1.14
Volume:
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:
July
Year:
2025
Address:
Vienna, Austria and virtual meeting
Editors:
Kaustubh Dhole, Miruna Clinciu
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
178–199
Language:
URL:
https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.14/
DOI:
Bibkey:
Cite (ACL):
Zhengxiang Wang, Jordan Kodner, and Owen Rambow. 2025. Evaluating LLMs with Multiple Problems at once. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 178–199, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Evaluating LLMs with Multiple Problems at once (Wang et al., GEM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.14.pdf