Exploring Limitations of LLM Capabilities with Multi-Problem Evaluation

Zhengxiang Wang, Jordan Kodner, Owen Rambow


Abstract
We propose using prompts made up of multiple problems to evaluate LLM capabilities, an approach we call multi-problem evaluation. We examine 7 LLMs on 4 related task types constructed from 6 existing classification benchmarks. We find that while LLMs can generally perform multiple homogeneous classifications at once (Batch Classification) as well as when they do so separately, they perform significantly worse on two selection tasks that are conceptually equivalent to Batch Classification and involve selecting indices of text falling into each class label, either independently or altogether. We show that such a significant performance drop is due to LLMs’ inability to adequately combine index selection with text classification. Such a drop is surprisingly observed across all LLMs attested, under zero-shot, few-shot, and CoT settings, and even with a novel synthetic dataset, potentially reflecting an inherent capability limitation with modern LLMs.
Anthology ID:
2025.insights-1.12
Volume:
The Sixth Workshop on Insights from Negative Results in NLP
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Aleksandr Drozd, João Sedoc, Shabnam Tafreshi, Arjun Akula, Raphael Shu
Venues:
insights | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
121–140
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.insights-1.12/
DOI:
Bibkey:
Cite (ACL):
Zhengxiang Wang, Jordan Kodner, and Owen Rambow. 2025. Exploring Limitations of LLM Capabilities with Multi-Problem Evaluation. In The Sixth Workshop on Insights from Negative Results in NLP, pages 121–140, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Exploring Limitations of LLM Capabilities with Multi-Problem Evaluation (Wang et al., insights 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.insights-1.12.pdf