ARC ‘Challenge’ Is Not That Challenging

Łukasz Borchmann


Abstract
ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.
Anthology ID:
2025.findings-acl.144
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2797–2804
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.144/
DOI:
Bibkey:
Cite (ACL):
Łukasz Borchmann. 2025. ARC ‘Challenge’ Is Not That Challenging. In Findings of the Association for Computational Linguistics: ACL 2025, pages 2797–2804, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
ARC ‘Challenge’ Is Not That Challenging (Borchmann, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.144.pdf