RiddleBench: A New Generative Reasoning Benchmark for LLMs

Deepon Halder, Alan Saji, Thanmay Jayakumar, Anoop Kunchukuttan, Ratish Puduppully, Raj Dabre


Abstract
While Large Language Models (LLMs) show remarkable capabilities, their complex reasoning skills require deeper investigation. We introduce **RiddleBench**, a new benchmark of 1,737 challenging puzzles designed to test reasoning beyond simple pattern matching. Our evaluation of state-of-the-art models reveals significant limitations, including hallucination cascades (uncritically accepting flawed peer reasoning) and poor self-correction due to strong self-confirmation bias. We also find that model performance is fragile, degrading when faced with reordered constraints or irrelevant information. RiddleBench serves as a resource for diagnosing these issues and guiding the development of more robust LLMs.
Anthology ID:
2026.findings-eacl.228
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4363–4372
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.228/
DOI:
Bibkey:
Cite (ACL):
Deepon Halder, Alan Saji, Thanmay Jayakumar, Anoop Kunchukuttan, Ratish Puduppully, and Raj Dabre. 2026. RiddleBench: A New Generative Reasoning Benchmark for LLMs. In Findings of the Association for Computational Linguistics: EACL 2026, pages 4363–4372, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
RiddleBench: A New Generative Reasoning Benchmark for LLMs (Halder et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.228.pdf
Checklist:
 2026.findings-eacl.228.checklist.pdf