Bikash Gyawali

2020

pdf bib abs
Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings
Bikash Gyawali | Lucas Anastasiou | Petr Knoth
Proceedings of the Twelfth Language Resources and Evaluation Conference

Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time.

pdf bib
Proceedings of the 8th International Workshop on Mining Scientific Publications
Petr Knoth | Christopher Stahl | Bikash Gyawali | David Pride | Suchetha N. Kunnath | Drahomira Herrmannova
Proceedings of the 8th International Workshop on Mining Scientific Publications

pdf bib abs
Overview of the 2020 WOSP 3C Citation Context Classification Task
Suchetha Nambanoor Kunnath | David Pride | Bikash Gyawali | Petr Knoth
Proceedings of the 8th International Workshop on Mining Scientific Publications

The 3C Citation Context Classification task is the first shared task addressing citation context classification. The two subtasks, A and B, associated with this shared task, involves the classification of citations based on their purpose and influence, respectively. Both tasks use a portion of the new ACT dataset, developed by the researchers at The Open University, UK. The tasks were hosted on Kaggle, and the participated systems were evaluated using the macro f-score. Three teams participated in subtask A and four teams participated in subtask B. The best performing systems obtained an overall score of 0.2056 for subtask A and 0.5556 for subtask B, outperforming the simple majority class baseline models, which scored 0.11489 and 0.32249, respectively. In this paper we provide a report specifying the shared task, the dataset used, a short description of the participating systems and the final results obtained by the teams based on the evaluation criteria. The shared task has been organised as part of the 8th International Workshop on Mining Scientific Publications (WOSP 2020) workshop.

Bikash Gyawali

Fixing paper assignments

2020

2015

2014

2013

Co-authors

Venues