An Annotated Dataset for Transformer-based Scholarly Information Extraction and Linguistic Linked Data Generation

Vayianos Pertsas; Marialena Kasapaki; Panos Constantopoulos

An Annotated Dataset for Transformer-based Scholarly Information Extraction and Linguistic Linked Data Generation

Vayianos Pertsas, Marialena Kasapaki, Panos Constantopoulos

Abstract

We present a manually curated and annotated, multidisciplinary dataset of 15,262 sentences from research articles (abstract and main text) that can be used for transformer-based extraction from scholarly publications of three types of entities: 1) research methods, named entities of variable length, 2) research goals, entities that appear as textual spans of variable length with mostly fixed lexico-syntactic-structure, and 3) research activities, entities that appear as textual spans of variable length with complex lexico-syntactic structure. We explore the capabilities of our dataset by using it for training/fine-tuning various ML and transformer-based models. We compare our finetuned models as well as LLM responses (chatGPT 3.5) based on 10-shot learning, by measuring F1 scores in token-based, entity-based strict and entity-based partial evaluations across interdisciplinary and discipline-specific datasets in order to capture any possible differences in discipline-oriented writing styles. Results show that fine tuning of transformer-based models significantly outperforms the performance of few- shot learning of LLMs such as chatGPT, highlighting the significance of annotation datasets in such tasks. Our dataset can also be used as a source for linguistic linked data by itself. We demonstrate this by presenting indicative queries in SPARQL, executed over such an RDF knowledge graph.

Anthology ID:: 2024.ldl-1.11
Volume:: Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Christian Chiarcos, Katerina Gkirtzou, Maxim Ionov, Fahad Khan, John P. McCrae, Elena Montiel Ponsoda, Patricia Martín Chozas
Venues:: LDL | WS
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 84–93
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.ldl-1.11/
DOI:
Bibkey:
Cite (ACL):: Vayianos Pertsas, Marialena Kasapaki, and Panos Constantopoulos. 2024. An Annotated Dataset for Transformer-based Scholarly Information Extraction and Linguistic Linked Data Generation. In Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, pages 84–93, Torino, Italia. ELRA and ICCL.
Cite (Informal):: An Annotated Dataset for Transformer-based Scholarly Information Extraction and Linguistic Linked Data Generation (Pertsas et al., LDL 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.ldl-1.11.pdf

PDF Cite Search Fix data