Too long; didn’t solve

Lucía Cabrera; Jocelyn D’Arcy; Isaac Saxton-Knight

Too long; didn’t solve

Lucía Cabrera, Jocelyn D’Arcy, Isaac Saxton-Knight

Abstract

Mathematical benchmarks consisting of a range of mathematics problems are widely used to evaluate the reasoning abilities of large language models, yet little is known about how their structural properties influence model behaviour. In this work, we investigate two structural length variables, prompt length and solution length, and analyse how they relate to model performance on a newly constructed adversarial dataset of expert-authored mathematics problems. Across five evaluated models, we find that both prompt length and solution length are positively associated with model failure. These associations are statistically significant but modest, and we interpret them as descriptive rather than causal. We also include a secondary, exploratory analysis of cross-model disagreement. Because disagreement measures based on variance are mechanically constrained by mean failure, we treat this part of the analysis cautiously. Overall, our main finding is that structural length is linked to empirical difficulty in this benchmark, suggesting that length should be considered as a potential confounder when interpreting mathematical model evaluations.

Anthology ID:: 2026.evaleval-1.20
Volume:: Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Month:: July
Year:: 2026
Address:: San Diego, CA
Editors:: Mubashara Akhtar, Jan Batzner, Leshem Choshen, Avijit Ghosh, Usman Gohar, Jennifer Mickel, Ichhya Pant, Zeerak Talat, Michelle Lin
Venues:: EvalEval | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 100–110
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.20/
DOI:
Bibkey:
Cite (ACL):: Lucía Cabrera, Jocelyn D’Arcy, and Isaac Saxton-Knight. 2026. Too long; didn’t solve. In Proceedings of the Workshop on Evaluating Evaluations (EvalEval), pages 100–110, San Diego, CA. Association for Computational Linguistics.
Cite (Informal):: Too long; didn’t solve (Cabrera et al., EvalEval 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.20.pdf

PDF Cite Search Fix data