Proceedings of the First Workshop on Scholarly Document Processing

Muthu Kumar Chandrasekaran, Anita de Waard, Guy Feigenblat, Dayne Freitag, Tirthankar Ghosal, Eduard Hovy, Petr Knoth, David Konopnicki, Philipp Mayr, Robert M. Patton, Michal Shmueli-Scheuer (Editors)

Anthology ID:
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the First Workshop on Scholarly Document Processing
Muthu Kumar Chandrasekaran | Anita de Waard | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Eduard Hovy | Petr Knoth | David Konopnicki | Philipp Mayr | Robert M. Patton | Michal Shmueli-Scheuer

pdf bib
Overview of the First Workshop on Scholarly Document Processing (SDP)
Muthu Kumar Chandrasekaran | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Eduard Hovy | Philipp Mayr | Michal Shmueli-Scheuer | Anita de Waard

Next to keeping up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. To address these challenges, computational work on enhancing search, summarization, and analysis of scholarly documents has flourished. However, the various strands of research on scholarly document processing remain fragmented. To reach to the broader NLP and AI/ML community, pool distributed efforts and enable shared access to published research, we held the 1st Workshop on Scholarly Document Processing at EMNLP 2020 as a virtual event. The SDP workshop consisted of a research track (including a poster session), two invited talks and three Shared Tasks (CL-SciSumm, Lay-Summ and LongSumm), geared towards easier access to scientific methods and results. Website:

pdf bib
The future of arXiv and knowledge discovery in open science
Steinn Sigurdsson

arXiv, the preprint server for the physical and mathematical sciences, is in its third decade of operation. As the flow of new, open access research increases inexorably, the challenges to keep up with and discover research content also become greater. I will discuss the status and future of arXiv, and possibilities and plans to make more effective use of the research database to enhance ongoing research efforts.

Acknowledgement Entity Recognition in CORD-19 Papers
Jian Wu | Pei Wang | Xin Wei | Sarah Rajtmajer | C. Lee Giles | Christopher Griffin

Acknowledgements are ubiquitous in scholarly papers. Existing acknowledgement entity recognition methods assume all named entities are acknowledged. Here, we examine the nuances between acknowledged and named entities by analyzing sentence structure. We develop an acknowledgement extraction system, AckExtract based on open-source text mining software and evaluate our method using manually labeled data. AckExtract uses the PDF of a scholarly paper as input and outputs acknowledgement entities. Results show an overall performance of F_1=0.92. We built a supplementary database by linking CORD-19 papers with acknowledgement entities extracted by AckExtract including persons and organizations and find that only up to 50–60% of named entities are actually acknowledged. We further analyze chronological trends of acknowledgement entities in CORD-19 papers. All codes and labeled data are publicly available at

A Smart System to Generate and Validate Question Answer Pairs for COVID-19 Literature
Rohan Bhambhoria | Luna Feng | Dawn Sepehr | John Chen | Conner Cowling | Sedef Kocak | Elham Dolatabadi

Automatically generating question answer (QA) pairs from the rapidly growing coronavirus-related literature is of great value to the medical community. Creating high quality QA pairs would allow researchers to build models to address scientific queries for answers which are not readily available in support of the ongoing fight against the pandemic. QA pair generation is, however, a very tedious and time consuming task requiring domain expertise for annotation and evaluation. In this paper we present our contribution in addressing some of the challenges of building a QA system without gold data. We first present a method to create QA pairs from a large semi-structured dataset through the use of transformer and rule-based models. Next, we propose a means of engaging subject matter experts (SMEs) for annotating the QA pairs through the usage of a web application. Finally, we demonstrate some experiments showcasing the effectiveness of leveraging active learning in designing a high performing model with a substantially lower annotation effort from the domain experts.

Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset
Edwin Zhang | Nikhil Gupta | Raphael Tang | Xiao Han | Ronak Pradeep | Kuang Lu | Yue Zhang | Rodrigo Nogueira | Kyunghyun Cho | Hui Fang | Jimmy Lin

We present Covidex, a search engine that exploits the latest neural ranking models to provide information access to the COVID-19 Open Research Dataset curated by the Allen Institute for AI. Our system has been online and serving users since late March 2020. The Covidex is the user application component of our three-pronged strategy to develop technologies for helping domain experts tackle the ongoing global pandemic. In addition, we provide robust and easy-to-use keyword search infrastructure that exploits mature fusion-based methods as well as standalone neural ranking models that can be incorporated into other applications. These techniques have been evaluated in the multi-round TREC-COVID challenge: Our infrastructure and baselines have been adopted by many participants, including some of the best systems. In round 3, we submitted the highest-scoring run that took advantage of previous training data and the second-highest fully automatic run. In rounds 4 and 5, we submitted the highest-scoring fully automatic runs.

The impact of preprint servers in the formation of novel ideas
Swarup Satish | Zonghai Yao | Andrew Drozdov | Boris Veytsman

We study whether novel ideas in biomedical literature appear first in preprints or traditional journals. We develop a Bayesian method to estimate the time of appearance for a phrase in the literature, and apply it to a number of phrases, both automatically extracted and suggested by experts. We see that presently most phrases appear first in the traditional journals, but there is a number of phrases with the first appearance on preprint servers. A comparison of the general composition of texts from bioRxiv and traditional journals shows a growing trend of bioRxiv being predictive of traditional journals. We discuss the application of the method for related problems.

Effective distributed representations for academic expert search
Mark Berger | Jakub Zavrel | Paul Groth

Expert search aims to find and rank experts based on a user’s query. In academia, retrieving experts is an efficient way to navigate through a large amount of academic knowledge. Here, we study how different distributed representations of academic papers (i.e. embeddings) impact academic expert retrieval. We use the Microsoft Academic Graph dataset and experiment with different configurations of a document-centric voting model for retrieval. In particular, we explore the impact of the use of contextualized embeddings on search performance. We also present results for paper embeddings that incorporate citation information through retrofitting. Additionally, experiments are conducted using different techniques for assigning author weights based on author order. We observe that using contextual embeddings produced by a transformer model trained for sentence similarity tasks produces the most effective paper representations for document-centric expert retrieval. However, retrofitting the paper embeddings and using elaborate author contribution weighting strategies did not improve retrieval performance.

Learning CNF Blocking for Large-scale Author Name Disambiguation
Kunho Kim | Athar Sefid | C. Lee Giles

Author name disambiguation (AND) algorithms identify a unique author entity record from all similar or same publication records in scholarly or similar databases. Typically, a clustering method is used that requires calculation of similarities between each possible record pair. However, the total number of pairs grows quadratically with the size of the author database making such clustering difficult for millions of records. One remedy is a blocking function that reduces the number of pairwise similarity calculations. Here, we introduce a new way of learning blocking schemes by using a conjunctive normal form (CNF) in contrast to the disjunctive normal form (DNF). We demonstrate on PubMed author records that CNF blocking reduces more pairs while preserving high pairs completeness compared to the previous methods that use a DNF and that the computation time is significantly reduced. In addition, we also show how to ensure that the method produces disjoint blocks so that much of the AND algorithm can be efficiently paralleled. Our CNF blocking method is tested on the entire PubMed database of 80 million author mentions and efficiently removes 82.17% of all author record pairs in 10 minutes.

Reconstructing Manual Information Extraction with DB-to-Document Backprojection: Experiments in the Life Science Domain
Mark-Christoph Müller | Sucheta Ghosh | Maja Rey | Ulrike Wittig | Wolfgang Müller | Michael Strube

We introduce a novel scientific document processing task for making previously inaccessible information in printed paper documents available to automatic processing. We describe our data set of scanned documents and data records from the biological database SABIO-RK, provide a definition of the task, and report findings from preliminary experiments. Rigorous evaluation proved challenging due to lack of gold-standard data and a difficult notion of correctness. Qualitative inspection of results, however, showed the feasibility and usefulness of the task

DeepPaperComposer: A Simple Solution for Training Data Preparation for Parsing Research Papers
Meng Ling | Jian Chen

We present DeepPaperComposer, a simple solution for preparing highly accurate (100%) training data without manual labeling to extract content from scholarly articles using convolutional neural networks (CNNs). We used our approach to generate data and trained CNNs to extract eight categories of both textual (titles, abstracts, headers, figure and table captions, and other texts) and non-textural content (figures and tables) from 30 years of IEEE VIS conference papers, of which a third were scanned bitmap PDFs. We curated this dataset and named it VISpaper-3K. We then showed our initial benchmark performance using VISpaper-3K over itself and CS-150 using YOLOv3 and Faster-RCNN. We open-source DeepPaperComposer of our training data generation and released the resulting annotation data VISpaper-3K to promote re-producible research.

Improved Local Citation Recommendation Based on Context Enhanced with Global Information
Zoran Medić | Jan Snajder

Local citation recommendation aims at finding articles relevant for given citation context. While most previous approaches represent context using solely text surrounding the citation, we propose enhancing context representation with global information. Specifically, we include citing article’s title and abstract into context representation. We evaluate our model on datasets with different citation context sizes and demonstrate improvements with globally-enhanced context representations when citation contexts are smaller.

On the effectiveness of small, discriminatively pre-trained language representation models for biomedical text mining
Ibrahim Burak Ozyurt

Neural language representation models such as BERT have recently shown state of the art performance in downstream NLP tasks and bio-medical domain adaptation of BERT (Bio-BERT) has shown same behavior on biomedical text mining tasks. However, due to their large model size and resulting increased computational need, practical application of models such as BERT is challenging making smaller models with comparable performance desirable for real word applications. Recently, a new language transformers based language representation model named ELECTRA is introduced, that makes efficient usage of training data in a generative-discriminative neural model setting that shows performance gains over BERT. These gains are especially impressive for smaller models. Here, we introduce two small ELECTRA based model named Bio-ELECTRA and Bio-ELECTRA++ that are eight times smaller than BERT Base and Bio-BERT and achieves comparable or better performance on biomedical question answering, yes/no question answer classification, question answer candidate ranking and relation extraction tasks. Bio-ELECTRA is pre-trained from scratch on PubMed abstracts using a consumer grade GPU with only 8GB memory. Bio-ELECTRA++ is the further pre-trained version of Bio-ELECTRA trained on a corpus of open access full papers from PubMed Central. While, for biomedical named entity recognition, large BERT Base model outperforms Bio-ELECTRA++, Bio-ELECTRA and ELECTRA-Small++, with hyperparameter tuning Bio-ELECTRA++ achieves results comparable to BERT.

SciWING– A Software Toolkit for Scientific Document Processing
Abhinav Ramesh Kashyap | Min-Yen Kan

We introduce SciWING, an open-source soft-ware toolkit which provides access to state-of-the-art pre-trained models for scientific document processing (SDP) tasks, such as citation string parsing, logical structure recovery and citation intent classification. Compared to other toolkits, SciWING follows a full neural pipeline and provides a Python inter-face for SDP. When needed, SciWING provides fine-grained control for rapid experimentation with different models by swapping and stacking different modules. Transfer learning from general and scientific documents specific pre-trained transformers (i.e., BERT, SciBERT, etc.) can be performed. SciWING incorporates ready-to-use web and terminal-based applications and demonstrations to aid adoption and development. The toolkit is available from and the demos are available at

Multi-task Peer-Review Score Prediction
Jiyi Li | Ayaka Sato | Kazuya Shimura | Fumiyo Fukumoto

Automatic prediction on the peer-review aspect scores of academic papers can be a useful assistant tool for both reviewers and authors. To handle the small size of published datasets on the target aspect of scores, we propose a multi-task approach to leverage additional information from other aspects of scores for improving the performance of the target. Because one of the problems of building multi-task models is how to select the proper resources of auxiliary tasks and how to select the proper shared structures. We propose a multi-task shared structure encoding approach which automatically selects good shared network structures as well as good auxiliary resources. The experiments based on peer-review datasets show that our approach is effective and has better performance on the target scores than the single-task method and naive multi-task methods.

ERLKG: Entity Representation Learning and Knowledge Graph based association analysis of COVID-19 through mining of unstructured biomedical corpora
Sayantan Basu | Sinchani Chakraborty | Atif Hassan | Sana Siddique | Ashish Anand

We introduce a generic, human-out-of-the-loop pipeline, ERLKG, to perform rapid association analysis of any biomedical entity with other existing entities from a corpora of the same domain. Our pipeline consists of a Knowledge Graph (KG) created from the Open Source CORD-19 dataset by fully automating the procedure of information extraction using SciBERT. The best latent entity representations are then found by benchnmarking different KG embedding techniques on the task of link prediction using a Graph Convolution Network Auto Encoder (GCN-AE). We demonstrate the utility of ERLKG with respect to COVID-19 through multiple qualitative evaluations. Due to the lack of a gold standard, we propose a relatively large intrinsic evaluation dataset for COVID-19 and use it for validating the top two performing KG embedding techniques. We find TransD to be the best performing KG embedding technique with Pearson and Spearman correlation scores of 0.4348 and 0.4570 respectively. We demonstrate that a considerable number of ERLKG’s top protein, chemical and disease predictions are currently in consideration for COVID-19 related research.

Towards Grounding of Formulae
Takuto Asakura | André Greiner-Petter | Akiko Aizawa | Yusuke Miyao

A large amount of scientific knowledge is represented within mixed forms of natural language texts and mathematical formulae. Therefore, a collaboration of natural language processing and formula analyses, so-called mathematical language processing, is necessary to enable computers to understand and retrieve information from the documents. However, as we will show in this project, a mathematical notation can change its meaning even within the scope of a single paragraph. This flexibility makes it difficult to extract the exact meaning of a mathematical formula. In this project, we will propose a new task direction for grounding mathematical formulae. Particularly, we are addressing the widespread misconception of various research projects in mathematical information retrieval, which presume that mathematical notations have a fixed meaning within a single document. We manually annotated a long scientific paper to illustrate the task concept. Our high inter-annotator agreement shows that the task is well understood for humans. Our results indicate that it is worthwhile to grow the techniques for the proposed task to contribute to the further progress of mathematical language processing.

SChuBERT: Scholarly Document Chunks with BERT-encoding boost Citation Count Prediction.
Thomas van Dongen | Gideon Maillette de Buy Wenniger | Lambert Schomaker

Predicting the number of citations of scholarly documents is an upcoming task in scholarly document processing. Besides the intrinsic merit of this information, it also has a wider use as an imperfect proxy for quality which has the advantage of being cheaply available for large volumes of scholarly documents. Previous work has dealt with number of citations prediction with relatively small training data sets, or larger datasets but with short, incomplete input text. In this work we leverage the open access ACL Anthology collection in combination with the Semantic Scholar bibliometric database to create a large corpus of scholarly documents with associated citation information and we propose a new citation prediction model called SChuBERT. In our experiments we compare SChuBERT with several state-of-the-art citation prediction models and show that it outperforms previous methods by a large margin. We also show the merit of using more training data and longer input for number of citations prediction.

Structure-Tags Improve Text Classification for Scholarly Document Quality Prediction
Gideon Maillette de Buy Wenniger | Thomas van Dongen | Eleri Aedmaa | Herbert Teun Kruitbosch | Edwin A. Valentijn | Lambert Schomaker

Training recurrent neural networks on long texts, in particular scholarly documents, causes problems for learning. While hierarchical attention networks (HANs) are effective in solving these problems, they still lose important information about the structure of the text. To tackle these problems, we propose the use of HANs combined with structure-tags which mark the role of sentences in the document. Adding tags to sentences, marking them as corresponding to title, abstract or main body text, yields improvements over the state-of-the-art for scholarly document quality prediction. The proposed system is applied to the task of accept/reject prediction on the PeerRead dataset and compared against a recent BiLSTM-based model and joint textual+visual model as well as against plain HANs. Compared to plain HANs, accuracy increases on all three domains.On the computation and language domain our new model works best overall, and increases accuracy 4.7% over the best literature result. We also obtain improvements when introducing the tags for prediction of the number of citations for 88k scientific publications that we compiled from the Allen AI S2ORC dataset. For our HAN-system with structure-tags we reach 28.5% explained variance, an improvement of 1.8% over our reimplementation of the BiLSTM-based model as well as 1.0% improvement over plain HANs.

Cydex: Neural Search Infrastructure for the Scholarly Literature
Shane Ding | Edwin Zhang | Jimmy Lin

Cydex is a platform that provides neural search infrastructure for domain-specific scholarly literature. The platform represents an abstraction of Covidex, our recently developed full-stack open-source search engine for the COVID-19 Open Research Dataset (CORD-19) from AI2. While Covidex takes advantage of the latest best practices for keyword search using the popular Lucene search library as well as state-of-the-art neural ranking models using T5, parts of the system were hard coded to only work with CORD-19. This paper describes our efforts to generalize Covidex into Cydex, which can be applied to scholarly literature in different domains. By decoupling corpus-specific configurations from the frontend implementation, we are able to demonstrate the generality of Cydex on two very different corpora: the ACL Anthology and a collection of hydrology abstracts. Our platform is entirely open source and available at

On the Use of Web Search to Improve Scientific Collections
Krutarth Patel | Cornelia Caragea | Sujatha Das Gollapalli

Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, current-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. In this paper, we propose a novel search-driven framework for acquiring documents for such scientific portals. Within our framework, publicly-available research paper titles and author names are used as queries to a Web search engine. We were able to obtain ~267,000 unique research papers through our fully-automated framework using ~76,000 queries, resulting in almost 200,000 more papers than the number of queries. Moreover, through a combination of title and author name search, we were able to recover 78% of the original searched titles.

Scaling Systematic Literature Reviews with Machine Learning Pipelines
Seraphina Goldfarb-Tarrant | Alexander Robertson | Jasmina Lazic | Theodora Tsouloufi | Louise Donnison | Karen Smyth

Systematic reviews, which entail the extraction of data from large numbers of scientific documents, are an ideal avenue for the application of machine learning. They are vital to many fields of science and philanthropy, but are very time-consuming and require experts. Yet the three main stages of a systematic review are easily done automatically: searching for documents can be done via APIs and scrapers, selection of relevant documents can be done via binary classification, and extraction of data can be done via sequence-labelling classification. Despite the promise of automation for this field, little research exists that examines the various ways to automate each of these tasks. We construct a pipeline that automates each of these aspects, and experiment with many human-time vs. system quality trade-offs. We test the ability of classifiers to work well on small amounts of data and to generalise to data from countries not represented in the training data. We test different types of data extraction with varying difficulty in annotation, and five different neural architectures to do the extraction. We find that we can get surprising accuracy and generalisability of the whole pipeline system with only 2 weeks of human-expert annotation, which is only 15% of the time it takes to do the whole review manually and can be repeated and extended to new data with no additional effort.

Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions
Dongyeop Kang | Andrew Head | Risham Sidhu | Kyle Lo | Daniel Weld | Marti A. Hearst

The task of definition detection is important for scholarly papers, because papers often make use of technical terminology that may be unfamiliar to readers. Despite prior work on definition detection, current approaches are far from being accurate enough to use in realworld applications. In this paper, we first perform in-depth error analysis of the current best performing definition detection system and discover major causes of errors. Based on this analysis, we develop a new definition detection system, HEDDEx, that utilizes syntactic features, transformer encoders, and heuristic filters, and evaluate it on a standard sentence-level benchmark. Because current benchmarks evaluate randomly sampled sentences, we propose an alternative evaluation that assesses every sentence within a document. This allows for evaluating recall in addition to precision. HEDDEx outperforms the leading system on both the sentence-level and the document-level tasks, by 12.7 F1 points and 14.4 F1 points, respectively. We note that performance on the high-recall document-level task is much lower than in the standard evaluation approach, due to the necessity of incorporation of document structure as features. We discuss remaining challenges in document-level definition detection, ideas for improvements, and potential issues for the development of reading aid applications.

A New Neural Search and Insights Platform for Navigating and Organizing AI Research
Marzieh Fadaee | Olga Gureenkova | Fernando Rejon Barrera | Carsten Schnober | Wouter Weerkamp | Jakub Zavrel

To provide AI researchers with modern tools for dealing with the explosive growth of the research literature in their field, we introduce a new platform, AI Research Navigator, that combines classical keyword search with neural retrieval to discover and organize relevant literature. The system provides search at multiple levels of textual granularity, from sentences to aggregations across documents, both in natural language and through navigation in a domain specific Knowledge Graph. We give an overview of the overall architecture of the system and of the components for document analysis, question answering, search, analytics, expert search, and recommendations.

Overview and Insights from the Shared Tasks at Scholarly Document Processing 2020: CL-SciSumm, LaySumm and LongSumm
Muthu Kumar Chandrasekaran | Guy Feigenblat | Eduard Hovy | Abhilasha Ravichander | Michal Shmueli-Scheuer | Anita de Waard

We present the results of three Shared Tasks held at the Scholarly Document Processing Workshop at EMNLP2020: CL-SciSumm, LaySumm and LongSumm. We report on each of the tasks, which received 18 submissions in total, with some submissions addressing two or three of the tasks. In summary, the quality and quantity of the submissions show that there is ample interest in scholarly document summarization, and the state of the art in this domain is at a midway point between being an impossible task and one that is fully resolved.

CIST@CL-SciSumm 2020, LongSumm 2020: Automatic Scientific Document Summarization
Lei Li | Yang Xie | Wei Liu | Yinan Liu | Yafei Jiang | Siya Qi | Xingyuan Li

Our system participates in two shared tasks, CL-SciSumm 2020 and LongSumm 2020. In the CL-SciSumm shared task, based on our previous work, we apply more machine learning methods on position features and content features for facet classification in Task1B. And GCN is introduced in Task2 to perform extractive summarization. In the LongSumm shared task, we integrate both the extractive and abstractive summarization ways. Three methods were tested which are T5 Fine-tuning, DPPs Sampling, and GRU-GCN/GAT.

Ling Chai | Guizhen Fu | Yuan Ni

We focus on systems for TASK1 (TASK 1A and TASK 1B) of CL-SciSumm Shared Task 2020 in this paper. Task 1A is regarded as a binary classification task of sentence pairs. The strategies of domain-specific embedding and special tokens based on language models are proposed. Fusion of contextualized embedding and extra information is further explored in this article. We leverage Sembert to capture the structured semantic information. The joint of BERT-based model and classifiers without neural networks is also exploited. For the Task 1B, a language model with different weights for classes is fine-tuned to accomplish a multi-label classification task. The results show that extra information can improve the identification of cited text spans. The end-to-end trained models outperform models trained with two stages, and the averaged prediction of multi-models is more accurate than an individual one.

IIITBH-IITP@CL-SciSumm20, CL-LaySumm20, LongSumm20
Saichethan Reddy | Naveen Saini | Sriparna Saha | Pushpak Bhattacharyya

In this paper, we present the IIIT Bhagalpur and IIT Patna team’s effort to solve the three shared tasks namely, CL-SciSumm 2020, CL-LaySumm 2020, LongSumm 2020 at SDP 2020. The theme of these tasks is to generate medium-scale, lay and long summaries, respectively, for scientific articles. For the first two tasks, unsupervised systems are developed, while for the third one, we develop a supervised system.The performances of all the systems were evaluated on the associated datasets with the shared tasks in term of well-known ROUGE metric.

AUTH @ CLSciSumm 20, LaySumm 20, LongSumm 20
Alexios Gidiotis | Stefanos Stefanidis | Grigorios Tsoumakas

We present the systems we submitted for the shared tasks of the Workshop on Scholarly Document Processing at EMNLP 2020. Our approaches to the tasks are focused on exploiting large Transformer models pre-trained on huge corpora and adapting them to the different shared tasks. For tasks 1A and 1B of CL-SciSumm we are using different variants of the BERT model to tackle the tasks of “cited text span” and “facet” identification. For the summarization tasks 2 of CL-SciSumm, LaySumm and LongSumm we make use of different variants of the PEGASUS model, with and without fine-tuning, adapted to the nuances of each one of those particular tasks.

UniHD@CL-SciSumm 2020: Citation Extraction as Search
Dennis Aumiller | Satya Almasian | Philip Hausner | Michael Gertz

This work presents the entry by the team from Heidelberg University in the CL-SciSumm 2020 shared task at the Scholarly Document Processing workshop at EMNLP 2020. As in its previous iterations, the task is to highlight relevant parts in a reference paper, depending on a citance text excerpt from a citing paper. We participated in tasks 1A (citation identification) and 1B (citation context classification). Contrary to most previous works, we frame Task 1A as a search relevance problem, and introduce a 2-step re-ranking approach, which consists of a preselection based on BM25 in addition to positional document features, and a top-k re-ranking with BERT. For Task 1B, we follow previous submissions in applying methods that deal well with low resources and imbalanced classes.

IITP-AI-NLP-ML@ CL-SciSumm 2020, CL-LaySumm 2020, LongSumm 2020
Santosh Kumar Mishra | Harshavardhan Kundarapu | Naveen Saini | Sriparna Saha | Pushpak Bhattacharyya

The publication rate of scientific literature increases rapidly, which poses a challenge for researchers to keep themselves updated with new state-of-the-art. Scientific document summarization solves this problem by summarizing the essential fact and findings of the document. In the current paper, we present the participation of IITP-AI-NLP-ML team in three shared tasks, namely, CL-SciSumm 2020, LaySumm 2020, LongSumm 2020, which aims to generate medium, lay, and long summaries of the scientific articles, respectively. To solve CL-SciSumm 2020 and LongSumm 2020 tasks, three well-known clustering techniques are used, and then various sentence scoring functions, including textual entailment, are used to extract the sentences from each cluster for a summary generation. For LaySumm 2020, an encoder-decoder based deep learning model has been utilized. Performances of our developed systems are evaluated in terms of ROUGE measures on the associated datasets with the shared task.

1A-Team / Martin-Luther-Universität Halle-Wittenberg@CLSciSumm 20
Artur Jurk | Maik Boltze | Georg Keller | Lorna Ulbrich | Anja Fischer

This document demonstrates our groups approach to the CL-SciSumm shared task 2020. There are three tasks in CL-SciSumm 2020. In Task 1a, we apply a Siamese neural network to identify the spans of text in the reference paper best reflecting a citation. In Task 1b, we use a SVM to classify the facet of a citation.

Team MLU@CL-SciSumm20: Methods for Computational Linguistics Scientific Citation Linkage
Rong Huang | Kseniia Krylova

This paper describes our approach to the CL-SciSumm 2020 shared task toward the problem of identifying reference span of the citing article in the referred article. In Task 1a, we apply and compare different methods in combination with similarity scores to identify spans of the reference text for the given citance. In Task 1b, we use a logistic regression to classifying the discourse facets.

Heng Zhang | Lifan Liu | Ruping Wang | Shaohu Hu | Shutian Ma | Chengzhi Zhang

This paper mainly introduces our methods for Task 1A and Task 1B of CL-SciSumm 2020. Task 1A is to identify reference text in reference paper. Traditional machine learning models and MLP model are used. We evaluate the performances of these models and submit the final results from the optimal model. Compared with previous work, we optimize the ratio of positive to negative examples after data sampling. In order to construct features for classification, we calculate similarities between reference text and candidate sentences based on sentence vectors. Accordingly, nine similarities are used, of which eight are chosen from what we used in CL-SciSumm 2019 and a new sentence similarity based on fastText is added. Task 1B is to classify the facets of reference text. Unlike the methods used in CL-SciSumm 2019, we construct inputs of models based on word vectors and add deep learning models for classification this year.

CiteQA@CLSciSumm 2020
Anjana Umapathy | Karthik Radhakrishnan | Kinjal Jain | Rahul Singh

In academic publications, citations are used to build context for a concept by highlighting relevant aspects from reference papers. Automatically identifying referenced snippets can help researchers swiftly isolate principal contributions of scientific works. In this paper, we exploit the underlying structure of scientific articles to predict reference paper spans and facets corresponding to a citation. We propose two methods to detect citation spans - keyphrase overlap, BERT along with structural priors. We fine-tune FastText embeddings and leverage textual, positional features to predict citation facets.

Dimsum @LaySumm 20
Tiezheng Yu | Dan Su | Wenliang Dai | Pascale Fung

Lay summarization aims to generate lay summaries of scientific papers automatically. It is an essential task that can increase the relevance of science for all of society. In this paper, we build a lay summary generation system based on BART model. We leverage sentence labels as extra supervision signals to improve the performance of lay summarization. In the CL-LaySumm 2020 shared task, our model achieves 46.00 Rouge1-F1 score.

ARTU / TU Wien and Artificial Researcher@ LongSumm 20
Alaa El-Ebshihy | Annisa Maulida Ningtyas | Linda Andersson | Florina Piroi | Andreas Rauber

In this paper, we present our approach to solve the LongSumm 2020 Shared Task, at the 1st Workshop on Scholarly Document Processing. The objective of the long summaries task is to generate long summaries that cover salient information in scientific articles. The task is to generate abstractive and extractive summaries of a given scientific article. In the proposed approach, we are inspired by the concept of Argumentative Zoning (AZ) that de- fines the main rhetorical structure in scientific articles. We define two aspects that should be covered in scientific paper summary, namely Claim/Method and Conclusion/Result aspects. We use Solr index to expand the sentences of the paper abstract. We formulate each abstract sentence in a given publication as query to retrieve similar sentences from the text body of the document itself. We utilize a sentence selection algorithm described in previous literature to select sentences for the final summary that covers the two aforementioned aspects.

Monash-Summ@LongSumm 20 SciSummPip: An Unsupervised Scientific Paper Summarization Pipeline
Jiaxin Ju | Ming Liu | Longxiang Gao | Shirui Pan

The Scholarly Document Processing (SDP) workshop is to encourage more efforts on natural language understanding of scientific task. It contains three shared tasks and we participate in the LongSumm shared task. In this paper, we describe our text summarization system, SciSummPip, inspired by SummPip (Zhao et al., 2020) that is an unsupervised text summarization system for multi-document in News domain. Our SciSummPip includes a transformer-based language model SciBERT (Beltagy et al., 2019) for contextual sentence representation, content selection with PageRank (Page et al., 1999), sentence graph construction with both deep and linguistic information, sentence graph clustering and within-graph summary generation. Our work differs from previous method in that content selection and a summary length constraint is applied to adapt to the scientific domain. The experiment results on both training dataset and blind test dataset show the effectiveness of our method, and we empirically verify the robustness of modules used in SciSummPip with BERTScore (Zhang et al., 2019a).

Using Pre-Trained Transformer for Better Lay Summarization
Seungwon Kim

In this paper, we tack lay summarization tasks, which aim to automatically produce lay summaries for scientific papers, to participate in the first CL-LaySumm 2020 in SDP workshop at EMNLP 2020. We present our approach of using Pre-training with Extracted Gap-sentences for Abstractive Summarization (PEGASUS; Zhang et al., 2019b) to produce the lay summary and combining those with the extractive summarization model using Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2018) and readability metrics that measure the readability of the sentence to further improve the quality of the summary. Our model achieves a remarkable performance on ROUGE metrics, demonstrating the produced summary is more readable while it summarizes the main points of the document.

Summaformers @ LaySumm 20, LongSumm 20
Sayar Ghosh Roy | Nikhil Pinnaparaju | Risubh Jain | Manish Gupta | Vasudeva Varma

Automatic text summarization has been widely studied as an important task in natural language processing. Traditionally, various feature engineering and machine learning based systems have been proposed for extractive as well as abstractive text summarization. Recently, deep learning based, specifically Transformer-based systems have been immensely popular. Summarization is a cognitively challenging task – extracting summary worthy sentences is laborious, and expressing semantics in brief when doing abstractive summarization is complicated. In this paper, we specifically look at the problem of summarizing scientific research papers from multiple domains. We differentiate between two types of summaries, namely, (a) LaySumm: A very short summary that captures the essence of the research paper in layman terms restricting overtly specific technical jargon and (b) LongSumm: A much longer detailed summary aimed at providing specific insights into various ideas touched upon in the paper. While leveraging latest Transformer-based models, our systems are simple, intuitive and based on how specific paper sections contribute to human summaries of the two types described above. Evaluations against gold standard summaries using ROUGE metrics prove the effectiveness of our approach. On blind test corpora, our system ranks first and third for the LongSumm and LaySumm tasks respectively.

Divide and Conquer: From Complexity to Simplicity for Lay Summarization
Rochana Chaturvedi | Saachi . | Jaspreet Singh Dhani | Anurag Joshi | Ankush Khanna | Neha Tomar | Swagata Duari | Alka Khurana | Vasudha Bhatnagar

We describe our approach for the 1st Computational Linguistics Lay Summary Shared Task CL-LaySumm20. The task is to produce non-technical summaries of scholarly documents. The summary should be within easy grasp of a layman who may not be well versed with the domain of the research article. We propose a two step divide-and-conquer approach. First, we judiciously select segments of the documents that are not overly pedantic and are likely to be of interest to the laity, and over-extract sentences from each segment using an unsupervised network based method. Next, we perform abstractive summarization on these extractions and systematically merge the abstractions. We run ablation studies to establish that each step in our pipeline is critical for improvement in the quality of lay summary. Our approach leverages state-of-the-art pre-trained deep neural network based models as zero-shot learners to achieve high scores on the task.

GUIR @ LongSumm 2020: Learning to Generate Long Summaries from Scientific Documents
Sajad Sotudeh Gharebagh | Arman Cohan | Nazli Goharian

This paper presents our methods for the LongSumm 2020: Shared Task on Generating Long Summaries for Scientific Documents, where the task is to generatelong summaries given a set of scientific papers provided by the organizers. We explore 3 main approaches for this task: 1. An extractive approach using a BERT-based summarization model; 2. A two stage model that additionally includes an abstraction step using BART; and 3. A new multi-tasking approach on incorporating document structure into the summarizer. We found that our new multi-tasking approach outperforms the two other methods by large margins. Among 9 participants in the shared task, our best model ranks top according to Rouge-1 score (53.11%) while staying competitive in terms of Rouge-2.