How Hard is Math? Using Quantitative Metrics to Measure LLM Alignment to Human Intuitions of Difficulty

Micah Helzerman, Steven R Wilson, Cam McLeman


Abstract
Modern LLMs have demonstrated advanced reasoning skills, including the ability to solve Olympiad-level mathematics problems. While solving more and more difficult problems is a hallmark of LLM progress, less attention has been placed on how "difficulty" is operationalized in the context of LLM problem solving tasks. This is particularly relevant in educational contexts where teachers or students may ask LLMs for "easy" or "hard" questions. In this paper, we explore various quantitative measurements from LLM-generated solutions and evaluate their inter-correlations, as well as their correlation to human-annotated difficulty scores. We find moderate correlations between metrics using log probabilities and output lengths, including some that are more strongly correlated to difficulty than LLM accuracy. We also train ModernBERT to predict difficulty scores, leading to reasonable accuracy within a given benchmark, but decreased performance when generalizing to other math benchmarks. Finally, to explore connections between difficulty scores and human performance, we collect problems, human solutions, and human performance data from the Putnam competition. We find poor alignment between LLM metrics and human-assigned difficulty scores, despite strong correlations between those scores and human performance on the problems.
Anthology ID:
2026.acl-srw.85
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
968–981
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.85/
DOI:
Bibkey:
Cite (ACL):
Micah Helzerman, Steven R Wilson, and Cam McLeman. 2026. How Hard is Math? Using Quantitative Metrics to Measure LLM Alignment to Human Intuitions of Difficulty. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 968–981, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
How Hard is Math? Using Quantitative Metrics to Measure LLM Alignment to Human Intuitions of Difficulty (Helzerman et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.85.pdf