Malte Ostendorff


Claim Extraction and Law Matching for COVID-19-related Legislation
Niklas Dehio | Malte Ostendorff | Georg Rehm
Proceedings of the Thirteenth Language Resources and Evaluation Conference

To cope with the COVID-19 pandemic, many jurisdictions have introduced new or altered existing legislation. Even though these new rules are often communicated to the public in news articles, it remains challenging for laypersons to learn about what is currently allowed or forbidden since news articles typically do not reference underlying laws. We investigate an automated approach to extract legal claims from news articles and to match the claims with their corresponding applicable laws. We examine the feasibility of the two tasks concerning claims about COVID-19-related laws from Berlin, Germany. For both tasks, we create and make publicly available the data sets and report the results of initial experiments. We obtain promising results with Transformer-based models that achieve 46.7 F1 for claim extraction and 91.4 F1 for law matching, albeit with some conceptual limitations. Furthermore, we discuss challenges of current machine learning approaches for legal language processing and their ability for complex legal reasoning tasks.

Generating Extended and Multilingual Summaries with Pre-trained Transformers
Rémi Calizzano | Malte Ostendorff | Qian Ruan | Georg Rehm
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Almost all summarisation methods and datasets focus on a single language and short summaries. We introduce a new dataset called WikinewsSum for English, German, French, Spanish, Portuguese, Polish, and Italian summarisation tailored for extended summaries of approx. 11 sentences. The dataset comprises 39,626 summaries which are news articles from Wikinews and their sources. We compare three multilingual transformer models on the extractive summarisation task and three training scenarios on which we fine-tune mT5 to perform abstractive summarisation. This results in strong baselines for both extractive and abstractive summarisation on WikinewsSum. We also show how the combination of an extractive model with an abstractive one can be used to create extended abstractive summaries from long input documents. Finally, our results show that fine-tuning mT5 on all the languages combined significantly improves the summarisation performance on low-resource languages.

Semantic Relations between Text Segments for Semantic Storytelling: Annotation Tool - Dataset - Evaluation
Michael Raring | Malte Ostendorff | Georg Rehm
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Semantic Storytelling describes the goal to automatically and semi-automatically generate stories based on extracted, processed, classified and annotated information from large content resources. Essential is the automated processing of text segments extracted from different content resources by identifying the relevance of a text segment to a topic and its semantic relation to other text segments. In this paper we present an approach to create an automatic classifier for semantic relations between extracted text segments from different news articles. We devise custom annotation guidelines based on various discourse structure theories and annotate a dataset of 2,501 sentence pairs extracted from 2,638 Wikinews articles. For the annotation, we developed a dedicated annotation tool. Based on the constructed dataset, we perform initial experiments with Transformer language models that are trained for the automatic classification of semantic relations. Our results with promising high accuracy scores suggest the validity and applicability of our approach for future Semantic Storytelling solutions.

HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information
Qian Ruan | Malte Ostendorff | Georg Rehm
Findings of the Association for Computational Linguistics: ACL 2022

Transformer-based language models usually treat texts as linear sequences. However, most texts also have an inherent hierarchical structure, i.e., parts of a text can be identified using their position in this hierarchy. In addition, section titles usually indicate the common topic of their respective sentences. We propose a novel approach to formulate, extract, encode and inject hierarchical structure information explicitly into an extractive summarization model based on a pre-trained, encoder-only Transformer language model (HiStruct+ model), which improves SOTA ROUGEs for extractive summarization on PubMed and arXiv substantially. Using various experimental settings on three datasets (i.e., CNN/DailyMail, PubMed and arXiv), our HiStruct+ model outperforms a strong baseline collectively, which differs from our model only in that the hierarchical structure information is not injected. It is also observed that the more conspicuous hierarchical structure the dataset has, the larger improvements our method gains. The ablation study demonstrates that the hierarchical position information is the main contributor to our model’s SOTA performance.

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings
Malte Ostendorff | Nils Rethmeier | Isabelle Augenstein | Bela Gipp | Georg Rehm
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) language models sample-efficiently and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.


DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments
Remi Calizzano | Malte Ostendorff | Georg Rehm
Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments

We present our submission to the first subtask of GermEval 2021 (classification of German Facebook comments as toxic or not). Binary sequence classification is a standard NLP task with known state-of-the-art methods. Therefore, we focus on data preparation by using two different techniques: task-specific pre-training and data augmentation. First, we pre-train multilingual transformers (XLM-RoBERTa and MT5) on 12 hatespeech detection datasets in nine different languages. In terms of F1, we notice an improvement of 10% on average, using task-specific pre-training. Second, we perform data augmentation by labelling unlabelled comments, taken from Facebook, to increase the size of the training dataset by 79%. Models trained on the augmented training dataset obtain on average +0.0282 (+5%) F1 score compared to models trained on the original training dataset. Finally, the combination of the two techniques allows us to obtain an F1 score of 0.6899 with XLM- RoBERTa and 0.6859 with MT5. The code of the project is available at:

Fine-grained Classification of Political Bias in German News: A Data Set and Initial Experiments
Dmitrii Aksenov | Peter Bourgonje | Karolina Zaczynska | Malte Ostendorff | Julian Moreno-Schneider | Georg Rehm
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

We present a data set consisting of German news articles labeled for political bias on a five-point scale in a semi-supervised way. While earlier work on hyperpartisan news detection uses binary classification (i.e., hyperpartisan or not) and English data, we argue for a more fine-grained classification, covering the full political spectrum (i.e., far-left, left, centre, right, far-right) and for extending research to German data. Understanding political bias helps in accurately detecting hate speech and online abuse. We experiment with different classification methods for political bias detection. Their comparatively low performance (a macro-F1 of 43 for our best setup, compared to a macro-F1 of 79 for the binary classification task) underlines the need for more (balanced) data annotated in a fine-grained way.


Named Entities in Medical Case Reports: Corpus and Experiments
Sarah Schulz | Jurica Ševa | Samuel Rodriguez | Malte Ostendorff | Georg Rehm
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a new corpus comprising annotations of medical entities in case reports, originating from PubMed Central’s open access library. In the case reports, we annotate cases, conditions, findings, factors and negation modifiers. Moreover, where applicable, we annotate relations between these entities. As such, this is the first corpus of this kind made available to the scientific community in English. It enables the initial investigation of automatic information extraction from case reports through tasks like Named Entity Recognition, Relation Extraction and (sentence/paragraph) relevance detection. Additionally, we present four strong baseline systems for the detection of medical entities made available through the annotated dataset.

Aspect-based Document Similarity for Research Papers
Malte Ostendorff | Terry Ruas | Till Blume | Bela Gipp | Georg Rehm
Proceedings of the 28th International Conference on Computational Linguistics

Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classification task. We evaluate our aspect-based document similarity approach for research papers. Paper citations indicate the aspect-based similarity, i.e., the title of a section in which a citation occurs acts as a label for the pair of citing and cited paper. We apply a series of Transformer models such as RoBERTa, ELECTRA, XLNet, and BERT variations and compare them to an LSTM baseline. We perform our experiments on two newly constructed datasets of 172,073 research paper pairs from the ACL Anthology and CORD-19 corpus. According to our results, SciBERT is the best performing system with F1-scores of up to 0.83. A qualitative analysis validates our quantitative results and indicates that aspect-based document similarity indeed leads to more fine-grained recommendations.