Yifan Xiao


2025

pdf bib
HighMATH: Evaluating Math Reasoning of Large Language Models in Breadth and Depth
Yan Liu | Minghui Zhang | Bojian Xiong | Yifan Xiao | Yinong Sun | Yating Mei | Longyu Zeng | Jingchao Yang | Yang Wang | Deyi Xiong
Findings of the Association for Computational Linguistics: EMNLP 2025

With the rapid development of large language models (LLMs) in math reasoning, the accuracy of models on existing math benchmarks has gradually approached 90% or even higher. More challenging math benchmarks are hence urgently in need to satisfy the increasing evaluation demands. To bridge this gap, we propose HighMATH. Problems in HighMATH are collected according to 3 criteria: problem complexity, knowledge domain diversity and fine-grained annotations. We collect 5,293 problems from Chinese senior high school mathematics exams published in 2024, covering 8 subjects and 7 levels of difficulty, with each problem involving an average of more than 2.4 knowledge points. We conduct a thorough evaluation of latest LLMs on the curated HighMATH, including o1-like models. Evaluation results demonstrate that the accuracy of advanced LLMs on HighMATH is significantly lower than that on previous math reasoning benchmarks. This gap even exceeds 30%. Our results also suggest that properly trained smaller LLMs may have great potential in math reasoning. Our data is available at https://github.com/tjunlp-lab/HighMATH.