Extracting Software Mentions and Relations using Transformers and LLM-Generated Synthetic Data at SOMD 2025

Pranshu Rastogi, Rajneesh Tiwari


Abstract
As part of the SOMD 2025 shared task on Software Mention Detection, we solved the problem of detecting and disambiguating software mentions in academic texts. a very important but under appreciated factor in research transparency and reproducibility. Software is an essential building block of scientific activity, but it often does not receive official citation in scholarly literature, and there are many informal mentions that are hard to follow and analyse. In order to enhance research accessibility and interpretability, we built a system that identifies software mentions and their properties (e.g., version numbers, URLs) as named entities, and classify relationships between them. Our dataset contained approximately 1,100 manually annotated sentences of full-text scholarly articles, representing diverse types of software like operating systems and applications. We fine-tuned DeBERTa based models for the Named Entity Recognition (NER) task and handled Relation Extraction (RE) as a classification problem over entity pairs. Due to the dataset size, we employed Large Language Models to create synthetic training data for augmentation. Our system achieved strong performance, with a 65% F1 score on NER (ranking 2nd in test phase) and a 47% F1 score on RE and combined macro 56% F1, showing the performance of our approach in this area.
Anthology ID:
2025.sdp-1.17
Volume:
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Month:
July
Year:
2025
Address:
Venice, Austria
Editor:
Amanpreet Singh
Venues:
sdp | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
173–181
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.sdp-1.17/
DOI:
Bibkey:
Cite (ACL):
Pranshu Rastogi and Rajneesh Tiwari. 2025. Extracting Software Mentions and Relations using Transformers and LLM-Generated Synthetic Data at SOMD 2025. In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 173–181, Venice, Austria. Association for Computational Linguistics.
Cite (Informal):
Extracting Software Mentions and Relations using Transformers and LLM-Generated Synthetic Data at SOMD 2025 (Rastogi & Tiwari, sdp 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.sdp-1.17.pdf