Identifying Where Large Language Models Struggle in Answering Complex Questions
Xanh Ho, Florian Boudin, Saku Sugawara, Khoa Duong, Akiko Aizawa
Abstract
We design experiments to identify where Large Language Models (LLMs) struggle when answering complex questions.Our focus is on two key stages, mirroring the human QA process: 1) question decomposition, where the model breaks down a complex question into sub-questions and 2) subproblem solving, where it addresses each sub-question to obtain the final response.We preprocess and expand three multi-hop datasets to create experimental datasets featuring explicit and implicit multi-hop questions, crowdsourced and templated questions, and varying numbers of hops.Our results show that larger models (Llama 3.1 70B and o1) excel at decomposing explicit multi-hop questions but struggle with implicit ones, while smaller models (e.g., Llama 3.1 8B) have difficulty with both.In the sub-problem solving stage, all models perform well on simple questions with context.Furthermore, we found no correlation between accuracy in the question decomposition stage and final QA performance (direct response), highlighting a key difference between human and LLM reasoning.- Anthology ID:
- 2026.gem-main.11
- Volume:
- Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
- Venues:
- GEM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 112–123
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.11/
- DOI:
- Cite (ACL):
- Xanh Ho, Florian Boudin, Saku Sugawara, Khoa Duong, and Akiko Aizawa. 2026. Identifying Where Large Language Models Struggle in Answering Complex Questions. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 112–123, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- Identifying Where Large Language Models Struggle in Answering Complex Questions (Ho et al., GEM 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.11.pdf