Abstract
Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities, which is reflected in their performance on the current test tasks. This calls for a more challenging benchmark requiring highly advanced reasoning ability to be solved. In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles. Puzzles are sourced from the “5 Minute Mystery” platform and include a multiple-choice question for evaluation. Only 47% of humans solve a puzzle successfully on average, while the best human solvers achieve over 80% success rate. We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles. This indicates that there is still a significant gap in the deep reasoning abilities of LLMs and humans and highlights the need for further research in this area. Our work introduces a challenging benchmark for future studies on reasoning in language models and contributes to a better understanding of the limits of LLMs’ abilities.- Anthology ID:
- 2023.starsem-1.28
- Volume:
- Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Alexis Palmer, Jose Camacho-collados
- Venue:
- *SEM
- SIG:
- SIGLEX
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 314–322
- Language:
- URL:
- https://aclanthology.org/2023.starsem-1.28
- DOI:
- 10.18653/v1/2023.starsem-1.28
- Cite (ACL):
- Maksym Del and Mark Fishel. 2023. True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 314–322, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4 (Del & Fishel, *SEM 2023)
- PDF:
- https://preview.aclanthology.org/landing_page/2023.starsem-1.28.pdf