Sebastian Krause

2022

Text-editing models have recently become a prominent alternative to seq2seq models for monolingual text-generation tasks such as grammatical error correction, text simplification, and style transfer. These tasks share a common trait – they exhibit a large amount of textual overlap between the source and target texts. Text-editing models take advantage of this observation and learn to generate the output by predicting edit operations applied to the source sequence. In contrast, seq2seq models generate outputs word-by-word from scratch thus making them slow at inference time. Text-editing models provide several benefits over seq2seq models including faster inference speed, higher sample efficiency, and better control and interpretability of the outputs. This tutorial provides a comprehensive overview of the text-edit based models and current state-of-the-art approaches analyzing their pros and cons. We discuss challenges related to deployment and how these models help to mitigate hallucination and bias, both pressing challenges in the field of text generation.

2021

pdf abs
A Simple Recipe for Multilingual Grammatical Error Correction
Sascha Rothe | Jonathan Mallinson | Eric Malmi | Sebastian Krause | Aliaksei Severyn
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

This paper presents a simple recipe to trainstate-of-the-art multilingual Grammatical Error Correction (GEC) models. We achieve this by first proposing a language-agnostic method to generate a large number of synthetic examples. The second ingredient is to use large-scale multilingual language models (up to 11B parameters). Once fine-tuned on language-specific supervised sets we surpass the previous state-of-the-art results on GEC benchmarks in four languages: English, Czech, German and Russian. Having established a new set of baselines for GEC, we make our results easily reproducible and accessible by releasing a CLANG-8 dataset. It is produced by using our best model, which we call gT5, to clean the targets of a widely used yet noisy Lang-8 dataset. cLang-8 greatly simplifies typical GEC training pipelines composed of multiple fine-tuning stages – we demonstrate that performing a single fine-tuning stepon cLang-8 with the off-the-shelf language models yields further accuracy improvements over an already top-performing gT5 model for English.

2019

pdf abs
Encode, Tag, Realize: High-Precision Text Editing
Eric Malmi | Sebastian Krause | Sascha Rothe | Daniil Mirylenka | Aliaksei Severyn
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We propose LaserTagger - a sequence tagging approach that casts text generation as a text editing task. Target texts are reconstructed from the inputs using three main edit operations: keeping a token, deleting it, and adding a phrase before the token. To predict the edit operations, we propose a novel model, which combines a BERT encoder with an autoregressive Transformer decoder. This approach is evaluated on English text on four tasks: sentence fusion, sentence splitting, abstractive summarization, and grammar correction. LaserTagger achieves new state-of-the-art results on three of these tasks, performs comparably to a set of strong seq2seq baselines with a large number of training examples, and outperforms them when the number of examples is limited. Furthermore, we show that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment.

2018

pdf
Automatic Prediction of Discourse Connectives
Eric Malmi | Daniele Pighin | Sebastian Krause | Mikhail Kozhevnikov
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Redundancy Localization for the Conversationalization of Unstructured Responses
Sebastian Krause | Mikhail Kozhevnikov | Eric Malmi | Daniele Pighin
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

Conversational agents offer users a natural-language interface to accomplish tasks, entertain themselves, or access information. Informational dialogue is particularly challenging in that the agent has to hold a conversation on an open topic, and to achieve a reasonable coverage it generally needs to digest and present unstructured information from textual sources. Making responses based on such sources sound natural and fit appropriately into the conversation context is a topic of ongoing research, one of the key issues of which is preventing the agent’s responses from sounding repetitive. Targeting this issue, we propose a new task, known as redundancy localization, which aims to pinpoint semantic overlap between text passages. To help address it systematically, we formalize the task, prepare a public dataset with fine-grained redundancy labels, and propose a model utilizing a weak training signal defined over the results of a passage-retrieval system on web texts. The proposed model demonstrates superior performance compared to a state-of-the-art entailment model and yields encouraging results when applied to a real-world dialogue.

pdf abs
Generating Pattern-Based Entailment Graphs for Relation Extraction
Kathrin Eichler | Feiyu Xu | Hans Uszkoreit | Sebastian Krause
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

Relation extraction is the task of recognizing and extracting relations between entities or concepts in texts. A common approach is to exploit existing knowledge to learn linguistic patterns expressing the target relation and use these patterns for extracting new relation mentions. Deriving relation patterns automatically usually results in large numbers of candidates, which need to be filtered to derive a subset of patterns that reliably extract correct relation mentions. We address the pattern selection task by exploiting the knowledge represented by entailment graphs, which capture semantic relationships holding among the learned pattern candidates. This is motivated by the fact that a pattern may not express the target relation explicitly, but still be useful for extracting instances for which the relation holds, because its meaning entails the meaning of the target relation. We evaluate the usage of both automatically generated and gold-standard entailment graphs in a relation extraction scenario and present favorable experimental results, exhibiting the benefits of structuring and selecting patterns based on entailment graphs.

2016

pdf
Event Linking with Sentential Features from Convolutional Neural Networks
Sebastian Krause | Feiyu Xu | Hans Uszkoreit | Dirk Weissenborn
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

pdf abs
Creating Linked Data Morphological Language Resources with MMoOn - The Hebrew Morpheme Inventory
Bettina Klimek | Natanael Arndt | Sebastian Krause | Timotheus Arndt
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The development of standard models for describing general lexical resources has led to the emergence of numerous lexical datasets of various languages in the Semantic Web. However, equivalent models covering the linguistic domain of morphology do not exist. As a result, there are hardly any language resources of morphemic data available in RDF to date. This paper presents the creation of the Hebrew Morpheme Inventory from a manually compiled tabular dataset comprising around 52.000 entries. It is an ongoing effort of representing the lexemes, word-forms and morphologigal patterns together with their underlying relations based on the newly created Multilingual Morpheme Ontology (MMoOn). It will be shown how segmented Hebrew language data can be granularly described in a Linked Data format, thus, serving as an exemplary case for creating morpheme inventories of any inflectional language with MMoOn. The resulting dataset is described a) according to the structure of the underlying data format, b) with respect to the Hebrew language characteristic of building word-forms directly from roots, c) by exemplifying how inflectional information is realized and d) with regard to its enrichment with external links to sense resources.

pdf abs
Relation- and Phrase-level Linking of FrameNet with Sar-graphs
Aleksandra Gabryszak | Sebastian Krause | Leonhard Hennig | Feiyu Xu | Hans Uszkoreit
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Recent research shows the importance of linking linguistic knowledge resources for the creation of large-scale linguistic data. We describe our approach for combining two English resources, FrameNet and sar-graphs, and illustrate the benefits of the linked data in a relation extraction setting. While FrameNet consists of schematic representations of situations, linked to lexemes and their valency patterns, sar-graphs are knowledge resources that connect semantic relations from factual knowledge graphs to the linguistic phrases used to express instances of these relations. We analyze the conceptual similarities and differences of both resources and propose to link sar-graphs and FrameNet on the levels of relations/frames as well as phrases. The former alignment involves a manual ontology mapping step, which allows us to extend sar-graphs with new phrase patterns from FrameNet. The phrase-level linking, on the other hand, is fully automatic. We investigate the quality of the automatically constructed links and identify two main classes of errors.

pdf abs
TEG-REP: A corpus of Textual Entailment Graphs based on Relation Extraction Patterns
Kathrin Eichler | Feiyu Xu | Hans Uszkoreit | Leonhard Hennig | Sebastian Krause
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The task of relation extraction is to recognize and extract relations between entities or concepts in texts. Dependency parse trees have become a popular source for discovering extraction patterns, which encode the grammatical relations among the phrases that jointly express relation instances. State-of-the-art weakly supervised approaches to relation extraction typically extract thousands of unique patterns only potentially expressing the target relation. Among these patterns, some are semantically equivalent, but differ in their morphological, lexical-semantic or syntactic form. Some express a relation that entails the target relation. We propose a new approach to structuring extraction patterns by utilizing entailment graphs, hierarchical structures representing entailment relations, and present a novel resource of gold-standard entailment graphs based on a set of patterns automatically acquired using distant supervision. We describe the methodology used for creating the dataset and present statistics of the resource as well as an analysis of inference types underlying the entailment decisions.

2015

pdf
Sar-graphs: A Linked Linguistic Knowledge Resource Connecting Facts with Language
Sebastian Krause | Leonhard Hennig | Aleksandra Gabryszak | Feiyu Xu | Hans Uszkoreit
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications

pdf
Semi-automatic Generation of Multiple-Choice Tests from Mentions of Semantic Relations
Renlong Ai | Sebastian Krause | Walter Kasper | Feiyu Xu | Hans Uszkoreit
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications

pdf
A Web-based Collaborative Evaluation Tool for Automatically Learned Relation Extraction Patterns
Leonhard Hennig | Hong Li | Sebastian Krause | Feiyu Xu | Hans Uszkoreit
Proceedings of ACL-IJCNLP 2015 System Demonstrations

pdf
Idest: Learning a Distributed Representation for Event Patterns
Sebastian Krause | Enrique Alfonseca | Katja Filippova | Daniele Pighin
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

This paper presents a new resource for the training and evaluation needed by relation extraction experiments. The corpus consists of annotations of mentions for three semantic relations: marriage, parent―child, siblings, selected from the domain of biographic facts about persons and their social relationships. The corpus contains more than one hundred news articles from Tabloid Press. In the current corpus, we only consider the relation mentions occurring in the individual sentences. We provide multi-level annotations which specify the marked facts from relation, argument, entity, down to the token level, thus allowing for detailed analysis of linguistic phenomena and their interactions. A generic markup tool Recon developed at the DFKI LT lab has been utilised for the annotation task. The corpus has been annotated by two human experts, supported by additional conflict resolution conducted by a third expert. As shown in the evaluation, the annotation is of high quality as proved by the stated inter-annotator agreements both on sentence level and on relationmention level. The current corpus is already in active use in our research for evaluation of the relation extraction performance of our automatically learned extraction patterns.

pdf abs
Language Resources and Annotation Tools for Cross-Sentence Relation Extraction
Sebastian Krause | Hong Li | Feiyu Xu | Hans Uszkoreit | Robert Hummel | Luise Spielhagen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present a novel combination of two types of language resources dedicated to the detection of relevant relations (RE) such as events or facts across sentence boundaries. One of the two resources is the sar-graph, which aggregates for each target relation ten thousands of linguistic patterns of semantically associated relations that signal instances of the target relation (Uszkoreit and Xu, 2013). These have been learned from the Web by intra-sentence pattern extraction (Krause et al., 2012) and after semantic filtering and enriching have been automatically combined into a single graph. The other resource is cockrACE, a specially annotated corpus for the training and evaluation of cross-sentence RE. By employing our powerful annotation tool Recon, annotators mark selected entities and relations (including events), coreference relations among these entities and events, and also terms that are semantically related to the relevant relations and events. This paper describes how the two resources are created and how they complement each other.

Co-authors

Venues

ldl1