Vlad Stepanov
2025
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models
Konstantin Chernyshev
|
Vitaliy Polshkov
|
Vlad Stepanov
|
Alex Myasnikov
|
Ekaterina Artemova
|
Alexei Miasnikov
|
Sergei Tilga
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Current evaluations of mathematical skills in Large Language Models are constrained by benchmarks lacking scope, particularly for multi-modal problems — frequently relying on school-level, niche Olympiad-style, simple quiz-format, or relatively small datasets.To address this, we introduce **U-MATH**, a novel benchmark comprising **1,100** unpublished open-ended university-level problems sourced from current US curricula, with **20%** incorporating visual elements. Given the free-form nature of U-MATH problems, we employ LLM judges for solution evaluation and release 𝜇**-MATH**, a meta-evaluation benchmark composed of **1,084** U-MATH-derived tasks enabling precise assessment of these judges.Benchmarking leading LLMs reveals marked limitations in multi-modal reasoning, with maximum accuracy reaching 93.1% on textual tasks but only 58.5% on visual ones. Furthermore, solution judgment proves challenging, requiring the most advanced models to achieve meaningfully high performance, even still peaking at an imperfect F1-score of 90.1%.