Yifan Xiao

2025

With the rapid development of large language models (LLMs) in math reasoning, the accuracy of models on existing math benchmarks has gradually approached 90% or even higher. More challenging math benchmarks are hence urgently in need to satisfy the increasing evaluation demands. To bridge this gap, we propose HighMATH. Problems in HighMATH are collected according to 3 criteria: problem complexity, knowledge domain diversity and fine-grained annotations. We collect 5,293 problems from Chinese senior high school mathematics exams published in 2024, covering 8 subjects and 7 levels of difficulty, with each problem involving an average of more than 2.4 knowledge points. We conduct a thorough evaluation of latest LLMs on the curated HighMATH, including o1-like models. Evaluation results demonstrate that the accuracy of advanced LLMs on HighMATH is significantly lower than that on previous math reasoning benchmarks. This gap even exceeds 30%. Our results also suggest that properly trained smaller LLMs may have great potential in math reasoning. Our data is available at https://github.com/tjunlp-lab/HighMATH.

Co-authors

Deyi Xiong 1

Jingchao Yang 1

Longyu Zeng 1

Minghui Zhang 1

Venues

findings1

Fix author