Chuxue Cao
2025
Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving
Chuxue Cao
|
Mengze Li
|
Juntao Dai
|
Jinluan Yang
|
Zijian Zhao
|
Shengyu Zhang
|
Weijie Shi
|
Chengzhong Liu
|
Sirui Han
|
Yike Guo
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B’s low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs’ generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs’ mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.
SafeLawBench: Towards Safe Alignment of Large Language Models
Chuxue Cao
|
Han Zhu
|
Jiaming Ji
|
Qichao Sun
|
Zhenghao Zhu
|
Wu Yinyu
|
Josef Dai
|
Yaodong Yang
|
Sirui Han
|
Yike Guo
Findings of the Association for Computational Linguistics: ACL 2025
With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs’ safety evaluation from a legal perspective by proposing the SafeLawBench benchmark. SafeLawBench categorizes safety risks into three levels based on legal standards, providing a systematic and comprehensive framework for evaluation. It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks. Our evaluation included 2 closed-source LLMs and 18 open-source LLMs using zero-shot and few-shot prompting, highlighting the safety features of each model. We also evaluated the LLMs’ safety-related reasoning stability and refusal behavior. Additionally, we found that a majority voting mechanism can enhance model performance. Notably, even leading SOTA models like Claude-3.5-Sonnet and GPT-4o have not exceeded 80.5% accuracy in multi-choice tasks on SafeLawBench, while the average accuracy of 20 LLMs remains at 68.8%. We urge the community to prioritize research on the safety of LLMs.
Search
Fix author
Co-authors
- Yike Guo 2
- Sirui Han 2
- Juntao Dai 1
- Josef Dai 1
- Jiaming Ji (吉嘉铭) 1
- show all...