From Wordle to Fibble5: Evaluating LLM Reasoning Under Escalating Deception

Chang Liu


Abstract
Standard benchmarks for large language models (LLMs) assume that task feedback is truthful, but real-world reasoning often requires processing unreliable or adversarial information. We introduce WordleArenas, a benchmark platform that evaluates LLM reasoning robustness across a deception gradient. Building on Wordle and its deceptive variant Fibble (Chusap et al., 2025), we generalize to Fibblek (k = 0, . . . , 5 lies per row), creating a controlled evaluation of LLM robustness to misinformation. Across six arenas — standard Wordle (0 lies per row) through Fibble5 (5 lies per row) — we evaluate 41 models from 10 providers across 3,749 games. We find that (1) even one lie per row causes catastrophic performance drops (average win rate falls from 41.1% to 18.7%), (2) a sharp deception cliff emerges at 2–3 lies where nearly all models collapse to ≤3% win rate, and (3) model robustness to deception is poorly predicted by standard benchmark rankings. A surprising Fibble5 recovery emerges: some models recover partial performance when all feedback lies (average 9.5%), outperforming Fibble3 (0.3%) and Fibble4 (0.4%), because knowing that every tile lies restores deterministic — though partial — information. Our results demonstrate that truthful-feedback evaluations systematically overestimate LLM reasoning capabilities and that deception-aware benchmarks are essential for assessing real-world robustness. All code and data are publicly available.
Anthology ID:
2026.evaleval-1.5
Volume:
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Month:
July
Year:
2026
Address:
San Diego, CA
Editors:
Mubashara Akhtar, Jan Batzner, Leshem Choshen, Avijit Ghosh, Usman Gohar, Jennifer Mickel, Ichhya Pant, Zeerak Talat, Michelle Lin
Venues:
EvalEval | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
36–45
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.5/
DOI:
Bibkey:
Cite (ACL):
Chang Liu. 2026. From Wordle to Fibble5: Evaluating LLM Reasoning Under Escalating Deception. In Proceedings of the Workshop on Evaluating Evaluations (EvalEval), pages 36–45, San Diego, CA. Association for Computational Linguistics.
Cite (Informal):
From Wordle to Fibble5: Evaluating LLM Reasoning Under Escalating Deception (Liu, EvalEval 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.5.pdf