Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference

Thanh Le-Cong, Bach Le, Toby Murray


Abstract
Large Language Models (LLMs) are increasingly being used to automate programming tasks. However, the capabilities of LLMs in reasoning about program semantics are still inadequately studied, leaving substantial potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate the reasoning abilities of Large Language Models (LLMs) on program semantics. Specifically, it utilizes the task of synthesizing formal program specifications as a proxy measure for assessing the semantic reasoning of LLMs. This task requires both comprehensive reasoning over all possible program executions and the generation of precise, syntactically correct expressions that adhere to formal syntax and semantics. Using this benchmark, we evaluated the ability of LLMs to synthesize consistent and complete specifications. Our findings show that LLMs perform well with simple control flows but struggle with more complex structures, especially loops, even with advanced prompting. Additionally, LLMs exhibit limited robustness against semantic-preserving transformations. We also highlight common failure patterns and design self-repair prompts, improving success rates by 25%. FormalBench is packaged as an executable library and has been released at https://github.com/thanhlecongg/FormalBench/.
Anthology ID:
2025.acl-long.1068
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21991–22014
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1068/
DOI:
Bibkey:
Cite (ACL):
Thanh Le-Cong, Bach Le, and Toby Murray. 2025. Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21991–22014, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference (Le-Cong et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1068.pdf