Vlad Stepanov


2025

pdf bib
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models
Konstantin Chernyshev | Vitaliy Polshkov | Vlad Stepanov | Alex Myasnikov | Ekaterina Artemova | Alexei Miasnikov | Sergei Tilga
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

Current evaluations of mathematical skills in Large Language Models are constrained by benchmarks lacking scope, particularly for multi-modal problems — frequently relying on school-level, niche Olympiad-style, simple quiz-format, or relatively small datasets.To address this, we introduce **U-MATH**, a novel benchmark comprising **1,100** unpublished open-ended university-level problems sourced from current US curricula, with **20%** incorporating visual elements. Given the free-form nature of U-MATH problems, we employ LLM judges for solution evaluation and release 𝜇**-MATH**, a meta-evaluation benchmark composed of **1,084** U-MATH-derived tasks enabling precise assessment of these judges.Benchmarking leading LLMs reveals marked limitations in multi-modal reasoning, with maximum accuracy reaching 93.1% on textual tasks but only 58.5% on visual ones. Furthermore, solution judgment proves challenging, requiring the most advanced models to achieve meaningfully high performance, even still peaking at an imperfect F1-score of 90.1%.