Can Large Language Models Win the International Mathematical Games?

Alessio Cocchieri; Luca Ragazzi; Giuseppe Tagliavini; Lorenzo Tordi; Antonella Carbonaro; Gianluca Moro

Can Large Language Models Win the International Mathematical Games?

Alessio Cocchieri, Luca Ragazzi, Giuseppe Tagliavini, Lorenzo Tordi, Antonella Carbonaro, Gianluca Moro

Abstract

Recent advances in large language models (LLMs) have demonstrated strong mathematical reasoning abilities, even in visual contexts, with some models surpassing human performance on existing benchmarks. However, these benchmarks lack structured age categorization, clearly defined skill requirements, and—crucially—were not designed to assess human performance in international competitions. To address these limitations, we introduce MathGames, a new benchmark of 2,183 high-quality mathematical problems (both text-only and multimodal) in an open-ended format, sourced from an international mathematical games championships. Spanning seven age groups and a skill-based taxonomy, MathGames enables a structured evaluation of LLMs’ mathematical and logical reasoning abilities. Our experiments reveal a substantial gap between state-of-the-art LLMs and human participants—even 11-year-olds consistently outperform some of the strongest models—highlighting the need for advancements. Further, our detailed error analysis offers valuable insights to guide future research. The data is publicly available at https://disi-unibo-nlp.github.io/math-games.

Anthology ID:: 2025.emnlp-main.488
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9656–9682
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.488/
DOI:
Bibkey:
Cite (ACL):: Alessio Cocchieri, Luca Ragazzi, Giuseppe Tagliavini, Lorenzo Tordi, Antonella Carbonaro, and Gianluca Moro. 2025. Can Large Language Models Win the International Mathematical Games?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9656–9682, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Can Large Language Models Win the International Mathematical Games? (Cocchieri et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.488.pdf
Checklist:: 2025.emnlp-main.488.checklist.pdf

PDF Cite Search Checklist Fix data