Carolina Dias-Alexiou


2025

pdf bib
An in-depth human study of the mathematical reasoning abilities in Large Language Models
Carolina Dias-Alexiou | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of The 3rd Workshop on Mathematical Natural Language Processing (MathNLP 2025)

We study the generalization capabilities of large language models (LLM) through the lens of mathematical reasoning, asking if these models can recognize that two structures are the same even when they do not share the same nomenclature. We propose a human study to evaluate if LLMs reproduce proofs that they have most likely seen during training, but when the symbols do not match the ones seen. To test this in a controlled scenario, we look at proofs in propositional calculus, foundational for other logic systems, semantically complete and widely discussed online. We replace the implication operator () with an unrelated, arbitrary symbol () and ask experts to evaluate how the output of a selection of LLMs changes in terms of compliance, correctness, extensiveness and coherence. Our results show that nearly all our tested models produce lower quality proofs in this test, in particular open-weights models, suggesting the abilities of these LLMs to reason in this context have important limitations.