Evaluating LLMs with Multiple Problems at once

Zhengxiang Wang; Jordan Kodner; Owen Rambow

Evaluating LLMs with Multiple Problems at once

Zhengxiang Wang, Jordan Kodner, Owen Rambow

Abstract

This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once, a paradigm we call multi-problem evaluation (MPE). Unlike conventional single-problem evaluation, where a prompt presents a single problem and expects one specific answer, MPE places multiple problems together in a single prompt and assesses how well an LLM answers all these problems in a single output. Leveraging 6 classification and 12 reasoning benchmarks that already exist, we introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts. We experiment with a total of 13 LLMs from 5 model families on ZeMPE to present a comprehensive and systematic MPE. Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short. In addition, we perform in-depth further analyses and explore model-level factors that may enable multiple problem handling capabilities in LLMs. We release our corpus and code to facilitate future research.

Anthology ID:: 2025.gem-1.14
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Ofir Arviv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehrmann, Eliya Habba, Itay Itzhak, Simon Mille, Yotam Perlitz, Enrico Santus, João Sedoc, Michal Shmueli Scheuer, Gabriel Stanovsky, Oyvind Tafjord
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 178–199
Language:
URL:: https://preview.aclanthology.org/metadata-correction-jian-chen-ub/2025.gem-1.14/
DOI:
Bibkey:
Cite (ACL):: Zhengxiang Wang, Jordan Kodner, and Owen Rambow. 2025. Evaluating LLMs with Multiple Problems at once. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 178–199, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: Evaluating LLMs with Multiple Problems at once (Wang et al., GEM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/metadata-correction-jian-chen-ub/2025.gem-1.14.pdf

PDF Cite Search Fix data