Abstract
Theorem proving presents a significant challenge for large language models (LLMs) due to the requirement for formal proofs to be rigorously checked by proof assistants, such as Lean, eliminating any margin for error or hallucination. While existing LLM-based theorem provers attempt to operate autonomously, they often struggle with novel and complex theorems where human insights are essential. Lean Copilot is a novel framework that integrates LLM inference into the Lean proof assistant environment. In this work, we benchmark performance of several LLMs including general and math-specific models for theorem proving using the Lean Copilot framework. Our initial investigation suggests that a general-purpose large model like LLaMa-70B still has edge over math-specific smaller models for the task under consideration. We provide useful insights into the performance of different LLMs we chose for the task.- Anthology ID:
- 2024.nlp4science-1.18
- Volume:
- Proceedings of the 1st Workshop on NLP for Science (NLP4Science)
- Month:
- November
- Year:
- 2024
- Address:
- Miami, FL, USA
- Editors:
- Lotem Peled-Cohen, Nitay Calderon, Shir Lissak, Roi Reichart
- Venue:
- NLP4Science
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 208–218
- Language:
- URL:
- https://aclanthology.org/2024.nlp4science-1.18
- DOI:
- 10.18653/v1/2024.nlp4science-1.18
- Cite (ACL):
- Vanessa Lama, Catherine Ma, and Tirthankar Ghosal. 2024. Benchmarking Automated Theorem Proving with Large Language Models. In Proceedings of the 1st Workshop on NLP for Science (NLP4Science), pages 208–218, Miami, FL, USA. Association for Computational Linguistics.
- Cite (Informal):
- Benchmarking Automated Theorem Proving with Large Language Models (Lama et al., NLP4Science 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.nlp4science-1.18.pdf