Natural Language Reasoning in Large Language Models: Analysis and Evaluation

Debela Gemechu; Ramon Ruiz-Dolz; Henrike Beyer; Chris Reed

Natural Language Reasoning in Large Language Models: Analysis and Evaluation

Debela Gemechu, Ramon Ruiz-Dolz, Henrike Beyer, Chris Reed

Abstract

While Large Language Models (LLMs) have demonstrated promising results on a range of reasoning benchmarks—particularly in formal logic, mathematical tasks, and Chain-of-Thought prompting—less is known about their capabilities in unconstrained natural language reasoning. Argumentative reasoning, a form of reasoning naturally expressed in language and central to everyday discourse, presents unique challenges for LLMs due to its reliance on context, implicit assumptions, and value judgments. This paper addresses a gap in the study of reasoning in LLMs by presenting the first large-scale evaluation of their unconstrained natural language reasoning capabilities based on natural language argumentation. The paper offers three contributions: (i) the formalisation of a new strategy designed to evaluate argumentative reasoning in LLMs: argument-component selection; (ii) the creation of the Argument Reasoning Tasks (ART) dataset, a new benchmark for argument-component selection based on argument structures for natural language reasoning; and (iii) an extensive experimental analysis involving four different models, demonstrating the limitations of LLMs on natural language reasoning tasks.

Anthology ID:: 2025.findings-acl.192
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3717–3741
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.192/
DOI:
Bibkey:
Cite (ACL):: Debela Gemechu, Ramon Ruiz-Dolz, Henrike Beyer, and Chris Reed. 2025. Natural Language Reasoning in Large Language Models: Analysis and Evaluation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 3717–3741, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Natural Language Reasoning in Large Language Models: Analysis and Evaluation (Gemechu et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.192.pdf

PDF Cite Search Fix data