Saurav Karmakar


2023

pdf
GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets
Wolfgang Otto | Matthäus Zloch | Lu Gan | Saurav Karmakar | Stefan Dietze
Findings of the Association for Computational Linguistics: EMNLP 2023

Named Entity Recognition (NER) models play a crucial role in various NLP tasks, including information extraction (IE) and text understanding. In academic writing, references to machine learning models and datasets are fundamental components of various computer science publications and necessitate accurate models for identification. Despite the advancements in NER, existing ground truth datasets do not treat fine-grained types like ML model and model architecture as separate entity types, and consequently, baseline models cannot recognize them as such. In this paper, we release a corpus of 100 manually annotated full-text scientific publications and a first baseline model for 10 entity types centered around ML models and datasets. In order to provide a nuanced understanding of how ML models and datasets are mentioned and utilized, our dataset also contains annotations for informal mentions like “our BERT-based model” or “an image CNN”. You can find the ground truth dataset and code to replicate model training at https://data.gesis.org/gsap/gsap-ner.

2020

pdf
CogALex-VI Shared Task: Bidirectional Transformer based Identification of Semantic Relations
Saurav Karmakar | John P. McCrae
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

This paper presents a bidirectional transformer based approach for recognising semantic relationships between a pair of words as proposed by CogALex VI shared task in 2020. The system presented here works by employing BERT embeddings of the words and passing the same over tuned neural network to produce a learning model for the pair of words and their relationships. Afterwards the very same model is used for the relationship between unknown words from the test set. CogALex VI provided Subtask 1 as the identification of relationship of three specific categories amongst English pair of words and the presented system opts to work on that. The resulted relationships of the unknown words are analysed here which shows a balanced performance in overall characteristics with some scope for improvement.