2025
pdf
bib
abs
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification
Chengwu Liu
|
Ye Yuan
|
Yichun Yin
|
Yan Xu
|
Xin Xu
|
Zaoyu Chen
|
Yasheng Wang
|
Lifeng Shang
|
Qun Liu
|
Ming Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that “the gold standard for supporting a mathematical claim is to provide a proof”. We propose a retrospective, step-aware formal verification framework Safe. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework Safe across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose FormalStep as a benchmark for step correctness theorem proving with 30,809 formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.
pdf
bib
abs
CCNU at SemEval-2025 Task 8: Enhancing Question Answering on Tabular Data with Two-Stage Corrections
Chenlian Zhou
|
Xilu Cai
|
Yajuan Tong
|
Chengzhao Wu
|
Xin Xu
|
Guanyi Chen
|
Tingting He
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
We present the system developed by the Central China Normal University (CCNU) team for the SemEval-2025 shared task 8, which focuses on Question-Answering (QA) for tabular data. Our approach leverages multiple Large Language Models (LLMs), conducting tabular QA as code completion. Additionally, to improve its reliability, we introduce a two-stage corrections mechanism, in which we instruct the LLM to correct the code according to the judges of whether the code is executable and whether the answer obtained from executing the code is semantically consistent with the question. The experiment demonstrates that code correction works but answer correction does not. Finally, we discuss other unsuccessful approaches explored during our development process.
2023
pdf
bib
abs
LAMBADA: Backward Chaining for Automated Reasoning in Natural Language
Mehran Kazemi
|
Najoung Kim
|
Deepti Bhatia
|
Xin Xu
|
Deepak Ramachandran
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Remarkable progress has been made on automated reasoning with natural text, by using Large Language Models (LLMs) and methods such as Chain-of-Thought prompting and Selection-Inference. These techniques search for proofs in the forward direction from axioms to the conclusion, which suffers from a combinatorial explosion of the search space, and thus high failure rates for problems requiring longer chains of reasoning. The classical automated reasoning literature has shown that reasoning in the backward direction (i.e. from intended conclusion to supporting axioms) is significantly more efficient at proof-finding. Importing this intuition into the LM setting, we develop a Backward Chaining algorithm, called LAMBADA, that decomposes reasoning into four sub-modules, that are simply implemented by few-shot prompted LLM inference. We show that LAMBADA achieves sizable accuracy boosts over state-of-the-art forward reasoning methods on two challenging logical reasoning datasets, particularly when deep and accurate proof chains are required.