Assessing the Sensitivity and Alignment of FOL Closeness Metrics

Ramya Keerthy Thatikonda; Wray Buntine; Ehsan Shareghi

doi:10.18653/v1/2025.findings-emnlp.910

Assessing the Sensitivity and Alignment of FOL Closeness Metrics

Ramya Keerthy Thatikonda, Wray Buntine, Ehsan Shareghi

Abstract

The recent successful paradigm of solving logical reasoning problems with tool-augmented large language models (LLMs) leverages translation of natural language (NL) statements into First-Order Logic (FOL) and external theorem provers. However, the correctness of FOL statements, comprising operators and text, often go unverified due to the lack of a reliable evaluation metric for comparing generated and ground-truth FOLs. In this paper, we conduct a comprehensive study on the sensitivity of existing metrics—NL, FOL, and graph-based— and their alignment with LLM as a judge on FOL evaluation to measure robustness. We introduce operator and text-based perturbations to ground-truth FOL statements to assess metric sensitivity. We then evaluate metric robustness by comparing them against LLMs judgement. Our empirical findings highlight a clear oversensitivity in the n-gram metric BLEU for text perturbations. The operator perturbation affects the semantic graph metric Smatch++ for structural changes, and the FOL metric for specific operator changes. We observe a closer alignment between BertScore and LLM judgement, proving the importance of semantic evaluation. Additionally, we show that combining metrics enhances both robustness and sensitivity compared to using individual metrics.

Anthology ID:: 2025.findings-emnlp.910
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16775–16785
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.910/
DOI:: 10.18653/v1/2025.findings-emnlp.910
Bibkey:
Cite (ACL):: Ramya Keerthy Thatikonda, Wray Buntine, and Ehsan Shareghi. 2025. Assessing the Sensitivity and Alignment of FOL Closeness Metrics. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 16775–16785, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Assessing the Sensitivity and Alignment of FOL Closeness Metrics (Thatikonda et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.910.pdf
Checklist:: 2025.findings-emnlp.910.checklist.pdf

PDF Cite Search Checklist Fix data