P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

Simeng Han; Aaron Yu; Rui Shen; Zhenting Qi; Martin Riddell; Wenfei Zhou; Yujie Qiao; Yilun Zhao; Semih Yavuz; Ye Liu; Shafiq Joty; Yingbo Zhou; Caiming Xiong; Dragomir Radev; Rex Ying; Arman Cohan

doi:10.18653/v1/2024.findings-emnlp.966

P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

Simeng Han, Aaron Yu, Rui Shen, Zhenting Qi, Martin Riddell, Wenfei Zhou, Yujie Qiao, Yilun Zhao, Semih Yavuz, Ye Liu, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Dragomir Radev, Rex Ying, Arman Cohan

Abstract

Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically derived rationales, which are not sufficient for properly assessing model’s capabilities. We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains for a set of realistic logical reasoning stories also written by humans. P-FOLIO is collected with an annotation protocol that facilitates humans to annotate well-structured natural language proofs for first-order logic reasoning problems in a step-by-step manner. The number of reasoning steps in P-FOLIO span from 0 to 20. We further use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities. We evaluate LLM reasoning capabilities at a fine granularity via single-step inference rule classification, with more diverse inference rules of more diverse and higher levels of complexities than previous works. Given that a single model-generated reasoning chain could take a completely different path than the human-annotated one, we sample multiple reasoning chains from a model and use pass@k metrics for evaluating the quality of model-generated reasoning chains. We show that human-written reasoning chains significantly boost the logical reasoning capabilities of LLMs via many-shot prompting and fine-tuning. Furthermore, fine-tuning Llam3-7B on P-FOLIO improves the model performance by 10% or more on three other out-of-domain logical reasoning datasets.

Anthology ID:: 2024.findings-emnlp.966
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16553–16565
Language:
URL:: https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.966/
DOI:: 10.18653/v1/2024.findings-emnlp.966
Bibkey:
Cite (ACL):: Simeng Han, Aaron Yu, Rui Shen, Zhenting Qi, Martin Riddell, Wenfei Zhou, Yujie Qiao, Yilun Zhao, Semih Yavuz, Ye Liu, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Dragomir Radev, Rex Ying, and Arman Cohan. 2024. P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16553–16565, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains (Han et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.966.pdf
Data:: 2024.findings-emnlp.966.data.zip

PDF Search Data Fix data