Reliable Evaluation and Benchmarks for Statement Autoformalization

Auguste Poiroux; Gail Weiss; Viktor Kunčak; Antoine Bosselut

Reliable Evaluation and Benchmarks for Statement Autoformalization

Auguste Poiroux, Gail Weiss, Viktor Kunčak, Antoine Bosselut

Abstract

Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context. Our work establishes a reliable foundation for evaluating and advancing autoformalization systems.

Anthology ID:: 2025.emnlp-main.907
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17958–17980
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.907/
DOI:
Bibkey:
Cite (ACL):: Auguste Poiroux, Gail Weiss, Viktor Kunčak, and Antoine Bosselut. 2025. Reliable Evaluation and Benchmarks for Statement Autoformalization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17958–17980, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Reliable Evaluation and Benchmarks for Statement Autoformalization (Poiroux et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.907.pdf
Checklist:: 2025.emnlp-main.907.checklist.pdf

PDF Cite Search Checklist Fix data