LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Mihir Parmar; Nisarg Patel; Neeraj Varshney; Mutsumi Nakamura; Man Luo; Santosh Mashetty; Arindam Mitra; Chitta Baral

doi:10.18653/v1/2024.acl-long.739

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, Chitta Baral

Abstract

Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really “reason” over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to ‘logical reasoning’ has remained underexplored. Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. Addressing the above limitation, we comprehensively evaluate the logical reasoning ability of LLMs on 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. To enable systematic evaluation, we introduce LogicBench, a natural language question-answering dataset focusing on the use of a single inference rule. We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes tend to prioritize parametric knowledge over contextual information and overlook the correct reasoning chain. We believe that our work and findings facilitate future research for evaluating and enhancing the logical reasoning ability of LLMs.

Anthology ID:: 2024.acl-long.739
Volume:: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13679–13707
Language:
URL:: https://preview.aclanthology.org/build-pipeline-with-new-library/2024.acl-long.739/
DOI:: 10.18653/v1/2024.acl-long.739
Bibkey:
Cite (ACL):: Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. 2024. LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13679–13707, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models (Parmar et al., ACL 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/build-pipeline-with-new-library/2024.acl-long.739.pdf

PDF Search Fix metadata