Yeliang Xiu

2025

pdf bib abs
LTRAG: Enhancing Autoformalization and Self-refinement for Logical Reasoning with Thought-Guided RAG
Ruikang Hu | Shaoyu Lin | Yeliang Xiu | Yongmei Liu
Findings of the Association for Computational Linguistics: ACL 2025

Logical reasoning is fundamental to intelligent systems. Large language models (LLMs) have demonstrated promise in natural language (NL) reasoning, especially with techniques like chain-of-thought (CoT) prompting. Neuro-symbolic methods like Logic-LM and LINC further enhance performance on challenging datasets FOLIO and AR-LSAT by integrating formalization with LLMs and symbolic solvers, and possibly refinement with LLMs. However, these methods still struggle with the accurate formalization of complex NL problems.In this paper, we introduce LTRAG, a framework to enhance autoformalization and self-refinement for logical reasoning with Retrieval-Augmented Generation (RAG), by building knowledge bases of thought-guided examples (https://github.com/sysulic/LTRAG ).Experimental results on FOLIO and AR-LSAT show that LTRAG consistently outperforms Logic-LM and LINC across different models. On GPT-4 and AR-LSAT, it achieves an accuracy gain of 13% over Logic-LM.

2022

pdf bib abs
LogicNMR: Probing the Non-monotonic Reasoning Ability of Pre-trained Language Models
Yeliang Xiu | Zhanhao Xiao | Yongmei Liu
Findings of the Association for Computational Linguistics: EMNLP 2022

The logical reasoning capabilities of pre-trained language models have recently received much attention. As one of the vital reasoning paradigms, non-monotonic reasoning refers to the fact that conclusions may be invalidated with new information. Existing work has constructed a non-monotonic inference dataset 𝛿-NLI and explored the performance of language models on it. However, the 𝛿-NLI dataset is entangled with commonsense reasoning. In this paper, we explore the pure non-monotonic reasoning ability of pre-trained language models. We build a non-monotonic reasoning benchmark, named LogicNMR, with explicit default rules and iterative updates. In the experimental part, the performance of popular language models on LogicNMR is explored from the perspectives of accuracy, generalization, proof-based traceability and robustness. The experimental results show that even though the fine-tuned language models achieve an accuracy of more than 94.4% on LogicNMR, they perform unsatisfactorily, with a significant drop, in generalization and proof-based traceability.

Co-authors

Venues

findings2

Fix author