U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models

Konstantin Chernyshev; Vitaliy Polshkov; Vlad Stepanov; Alex Myasnikov; Ekaterina Artemova; Alexei Miasnikov; Sergei Tilga

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models

Konstantin Chernyshev, Vitaliy Polshkov, Vlad Stepanov, Alex Myasnikov, Ekaterina Artemova, Alexei Miasnikov, Sergei Tilga

Abstract

Current evaluations of mathematical skills in Large Language Models are constrained by benchmarks lacking scope, particularly for multi-modal problems — frequently relying on school-level, niche Olympiad-style, simple quiz-format, or relatively small datasets.To address this, we introduce **U-MATH**, a novel benchmark comprising **1,100** unpublished open-ended university-level problems sourced from current US curricula, with **20%** incorporating visual elements. Given the free-form nature of U-MATH problems, we employ LLM judges for solution evaluation and release 𝜇**-MATH**, a meta-evaluation benchmark composed of **1,084** U-MATH-derived tasks enabling precise assessment of these judges.Benchmarking leading LLMs reveals marked limitations in multi-modal reasoning, with maximum accuracy reaching 93.1% on textual tasks but only 58.5% on visual ones. Furthermore, solution judgment proves challenging, requiring the most advanced models to achieve meaningfully high performance, even still peaking at an imperfect F1-score of 90.1%.

Anthology ID:: 2025.gem-1.77
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Kaustubh Dhole, Miruna Clinciu
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 974–1001
Language:
URL:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.77/
DOI:
Bibkey:
Cite (ACL):: Konstantin Chernyshev, Vitaliy Polshkov, Vlad Stepanov, Alex Myasnikov, Ekaterina Artemova, Alexei Miasnikov, and Sergei Tilga. 2025. U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 974–1001, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models (Chernyshev et al., GEM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.77.pdf

PDF Cite Search Fix data