A Theorem-Proving-Based Evaluation of Neural Semantic Parsing

Hayate Funakura; Hyunsoo Kim; Koji Mineshima

A Theorem-Proving-Based Evaluation of Neural Semantic Parsing

Hayate Funakura, Hyunsoo Kim, Koji Mineshima

Abstract

Graph-matching metrics such as Smatch are the de facto standard for evaluating neural semantic parsers, yet they capture surface overlap rather than logical equivalence. We reassess evaluation by pairing graph-matching with automated theorem proving. We compare two approaches to building parsers: supervised fine-tuning (T5-Small/Base) and few-shot in-context learning (GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness. Across settings, we find that models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization reduces incidental target variability, improves well-formedness, and strengthens logical adequacy. Error analysis shows performance degrades with increasing formula complexity and with coordination, prepositional phrases, and passive voice; the dominant failures involve variable binding and indexing, and predicate naming. These findings highlight limits of graph-based metrics for reasoning-oriented applications and motivate logic-sensitive evaluation and training objectives together with simplified, normalized target representations. All data and code will be publicly released.

Anthology ID:: 2025.blackboxnlp-1.18
Volume:: Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Yonatan Belinkov, Aaron Mueller, Najoung Kim, Hosein Mohebbi, Hanjie Chen, Dana Arad, Gabriele Sarti
Venues:: BlackboxNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 295–306
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.blackboxnlp-1.18/
DOI:
Bibkey:
Cite (ACL):: Hayate Funakura, Hyunsoo Kim, and Koji Mineshima. 2025. A Theorem-Proving-Based Evaluation of Neural Semantic Parsing. In Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 295–306, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: A Theorem-Proving-Based Evaluation of Neural Semantic Parsing (Funakura et al., BlackboxNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.blackboxnlp-1.18.pdf

PDF Cite Search Fix data