Botao Li
2026
Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
Chengwu Liu | Yichun Yin | Ye Yuan | Jiaxuan Xie | Botao Li | Siqi Li | Jianhao Shen | Yan Xu | Lifeng Shang | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chengwu Liu | Yichun Yin | Ye Yuan | Jiaxuan Xie | Botao Li | Siqi Li | Jianhao Shen | Yan Xu | Lifeng Shang | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Most ATP benchmarks embed the final answer within the formal statement — a convention we call "Easy Mode" — a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability.We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof.To enable Hard Mode research, we make two contributions.First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks.Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers.DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode — while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.