SpecNFS: A Challenge Dataset Towards Extracting Formal Models from Natural Language Specifications

Sayontan Ghosh, Amanpreet Singh, Alex Merenstein, Wei Su, Scott A. Smolka, Erez Zadok, Niranjan Balasubramanian


Abstract
Can NLP assist in building formal models for verifying complex systems? We study this challenge in the context of parsing Network File System (NFS) specifications. We define a semantic-dependency problem over SpecIR, a representation language we introduce to model sentences appearing in NFS specification documents (RFCs) as IF-THEN statements, and present an annotated dataset of 1,198 sentences. We develop and evaluate semantic-dependency parsing systems for this problem. Evaluations show that even when using a state-of-the-art language model, there is significant room for improvement, with the best models achieving an F1 score of only 60.5 and 33.3 in the named-entity-recognition and dependency-link-prediction sub-tasks, respectively. We also release additional unlabeled data and other domain-related texts. Experiments show that these additional resources increase the F1 measure when used for simple domain-adaption and transfer-learning-based approaches, suggesting fruitful directions for further research
Anthology ID:
2022.lrec-1.233
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2166–2176
Language:
URL:
https://aclanthology.org/2022.lrec-1.233
DOI:
Bibkey:
Cite (ACL):
Sayontan Ghosh, Amanpreet Singh, Alex Merenstein, Wei Su, Scott A. Smolka, Erez Zadok, and Niranjan Balasubramanian. 2022. SpecNFS: A Challenge Dataset Towards Extracting Formal Models from Natural Language Specifications. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2166–2176, Marseille, France. European Language Resources Association.
Cite (Informal):
SpecNFS: A Challenge Dataset Towards Extracting Formal Models from Natural Language Specifications (Ghosh et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.lrec-1.233.pdf
Code
 stonybrooknlp/specnfs
Data
BioPenn Treebank