RiddleBench: A New Generative Reasoning Benchmark for LLMs
Deepon Halder, Alan Saji, Thanmay Jayakumar, Anoop Kunchukuttan, Ratish Puduppully, Raj Dabre
Abstract
While Large Language Models (LLMs) show remarkable capabilities, their complex reasoning skills require deeper investigation. We introduce **RiddleBench**, a new benchmark of 1,737 challenging puzzles designed to test reasoning beyond simple pattern matching. Our evaluation of state-of-the-art models reveals significant limitations, including hallucination cascades (uncritically accepting flawed peer reasoning) and poor self-correction due to strong self-confirmation bias. We also find that model performance is fragile, degrading when faced with reordered constraints or irrelevant information. RiddleBench serves as a resource for diagnosing these issues and guiding the development of more robust LLMs.- Anthology ID:
- 2026.findings-eacl.228
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4363–4372
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.228/
- DOI:
- Cite (ACL):
- Deepon Halder, Alan Saji, Thanmay Jayakumar, Anoop Kunchukuttan, Ratish Puduppully, and Raj Dabre. 2026. RiddleBench: A New Generative Reasoning Benchmark for LLMs. In Findings of the Association for Computational Linguistics: EACL 2026, pages 4363–4372, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- RiddleBench: A New Generative Reasoning Benchmark for LLMs (Halder et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.228.pdf