Iryna Gurevych


2021

pdf bib
Investigating Pretrained Language Models for Graph-to-Text Generation
Leonardo F. R. Ribeiro | Martin Schmitt | Hinrich Schütze | Iryna Gurevych
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI

Graph-to-text generation aims to generate fluent texts from graph-based data. In this paper, we investigate two recent pretrained language models (PLMs) and analyze the impact of different task-adaptive pretraining strategies for PLMs in graph-to-text generation. We present a study across three graph domains: meaning representations, Wikipedia knowledge graphs (KGs) and scientific KGs. We show that approaches based on PLMs BART and T5 achieve new state-of-the-art results and that task-adaptive pretraining strategies improve their performance even further. We report new state-of-the-art BLEU scores of 49.72 on AMR-LDC2017T10, 59.70 on WebNLG, and 25.66 on AGENDA datasets - a relative improvement of 31.8%, 4.5%, and 42.4%, respectively, with our models generating significantly more fluent texts than human references. In an extensive analysis, we identify possible reasons for the PLMs’ success on graph-to-text tasks. Our findings suggest that the PLMs benefit from similar facts seen during pretraining or fine-tuning, such that they perform well even when the input graph is reduced to a simple bag of node and edge labels.

pdf bib
Exploring Metaphoric Paraphrase Generation
Kevin Stowe | Nils Beck | Iryna Gurevych
Proceedings of the 25th Conference on Computational Natural Language Learning

Metaphor generation is a difficult task, and has seen tremendous improvement with the advent of deep pretrained models. We focus here on the specific task of metaphoric paraphrase generation, in which we provide a literal sentence and generate a metaphoric sentence which paraphrases that input. We compare naive, “free” generation models with those that exploit forms of control over the generation process, adding additional information based on conceptual metaphor theory. We evaluate two methods for generating paired training data, which is then used to train T5 models for free and controlled generation. We use crowdsourcing to evaluate the results, showing that free models tend to generate more fluent paraphrases, while controlled models are better at generating novel metaphors. We then analyze evaluation metrics, showing that different metrics are necessary to capture different aspects of metaphoric paraphrasing. We release our data and models, as well as our annotated results in order to facilitate development of better evaluation metrics.

pdf bib
Modeling Graph Structure via Relative Position for Text Generation from Knowledge Graphs
Martin Schmitt | Leonardo F. R. Ribeiro | Philipp Dufter | Iryna Gurevych | Hinrich Schütze
Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

We present Graformer, a novel Transformer-based encoder-decoder architecture for graph-to-text generation. With our novel graph self-attention, the encoding of a node relies on all nodes in the input graph - not only direct neighbors - facilitating the detection of global patterns. We represent the relation between two nodes as the length of the shortest path between them. Graformer learns to weight these node-node relations differently for different attention heads, thus virtually learning differently connected views of the input graph. We evaluate Graformer on two popular graph-to-text generation benchmarks, AGENDA and WebNLG, where it achieves strong performance while using many fewer parameters than other approaches.

pdf bib
AdapterFusion: Non-Destructive Task Composition for Transfer Learning
Jonas Pfeiffer | Aishwarya Kamath | Andreas Rücklé | Kyunghyun Cho | Iryna Gurevych
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Sequential fine-tuning and multi-task learning are methods aiming to incorporate knowledge from multiple tasks; however, they suffer from catastrophic forgetting and difficulties in dataset balancing. To address these shortcomings, we propose AdapterFusion, a new two stage learning algorithm that leverages knowledge from multiple tasks. First, in the knowledge extraction stage we learn task specific parameters called adapters, that encapsulate the task-specific information. We then combine the adapters in a separate knowledge composition step. We show that by separating the two stages, i.e., knowledge extraction and knowledge composition, the classifier can effectively exploit the representations learned from multiple tasks in a non-destructive manner. We empirically evaluate AdapterFusion on 16 diverse NLU tasks, and find that it effectively combines various types of knowledge at different layers of the model. We show that our approach outperforms traditional strategies such as full fine-tuning as well as multi-task learning. Our code and adapters are available at AdapterHub.ml.

pdf bib
Improving Factual Consistency Between a Response and Persona Facts
Mohsen Mesgar | Edwin Simpson | Iryna Gurevych
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Neural models for response generation produce responses that are semantically plausible but not necessarily factually consistent with facts describing the speaker’s persona. These models are trained with fully supervised learning where the objective function barely captures factual consistency. We propose to fine-tune these models by reinforcement learning and an efficient reward function that explicitly captures the consistency between a response and persona facts as well as semantic plausibility. Our automatic and human evaluations on the PersonaChat corpus confirm that our approach increases the rate of responses that are factually consistent with persona facts over its supervised counterpart while retains the language quality of responses.

pdf bib
TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning
Kexin Wang | Nils Reimers | Iryna Gurevych
Findings of the Association for Computational Linguistics: EMNLP 2021

Learning sentence embeddings often requires a large amount of labeled data. However, for most tasks and domains, labeled data is seldom available and creating it is expensive. In this work, we present a new state-of-the-art unsupervised method based on pre-trained Transformers and Sequential Denoising Auto-Encoder (TSDAE) which outperforms previous approaches by up to 6.4 points. It can achieve up to 93.1% of the performance of in-domain supervised approaches. Further, we show that TSDAE is a strong domain adaptation and pre-training method for sentence embeddings, significantly outperforming other approaches like Masked Language Model. A crucial shortcoming of previous studies is the narrow evaluation: Most work mainly evaluates on the single task of Semantic Textual Similarity (STS), which does not require any domain knowledge. It is unclear if these proposed methods generalize to other domains and tasks. We fill this gap and evaluate TSDAE and other recent approaches on four different datasets from heterogeneous domains.

pdf bib
A Neural Graph-based Local Coherence Model
Mohsen Mesgar | Leonardo F. R. Ribeiro | Iryna Gurevych
Findings of the Association for Computational Linguistics: EMNLP 2021

Entity grids and entity graphs are two frameworks for modeling local coherence. These frameworks represent entity relations between sentences and then extract features from such representations to encode coherence. The benefits of convolutional neural models for extracting informative features from entity grids have been recently studied. In this work, we study the benefits of Relational Graph Convolutional Networks (RGCN) to encode entity graphs for measuring local coherence. We evaluate our neural graph-based model for two benchmark coherence evaluation tasks: sentence ordering (SO) and summary coherence rating (SCR). The results show that our neural graph-based model consistently outperforms the neural grid-based model for both tasks. Our model performs competitively with a strong baseline coherence model, while our model uses 50% fewer parameters. Our work defines a new, efficient, and effective baseline for local coherence modeling.

pdf bib
Proceedings of the First Workshop on Interactive Learning for Natural Language Processing
Kianté Brantley | Soham Dan | Iryna Gurevych | Ji-Ung Lee | Filip Radlinski | Hinrich Schütze | Edwin Simpson | Lili Yu
Proceedings of the First Workshop on Interactive Learning for Natural Language Processing

pdf bib
Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora
Michael Bugert | Nils Reimers | Iryna Gurevych
Computational Linguistics, Volume 47, Issue 3 - November 2021

Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents. CDCR aims to benefit downstream multidocument applications, but despite recent progress on corpora and system development, downstream improvements from applying CDCR have not been shown yet. We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus. This raises strong concerns on their generalizability—a must-have for downstream applications where the magnitude of domains or event mentions is likely to exceed those found in a curated corpus. To investigate this assumption, we define a uniform evaluation setup involving three CDCR corpora: ECB+, the Gun Violence Corpus, and the Football Coreference Corpus (which we reannotate on token level to make our analysis possible). We compare a corpus-independent, feature-based system against a recent neural system developed for ECB+. Although being inferior in absolute numbers, the feature-based system shows more consistent performance across all corpora whereas the neural system is hit-or-miss. Via model introspection, we find that the importance of event actions, event time, and so forth, for resolving coreference in practice varies greatly between the corpora. Additional analysis shows that several systems overfit on the structure of the ECB+ corpus. We conclude with recommendations on how to achieve generally applicable CDCR systems in the future—the most important being that evaluation on multiple CDCR corpora is strongly necessary. To facilitate future research, we release our dataset, annotation guidelines, and system implementation to the public.1

pdf bib
Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks
Nandan Thakur | Nils Reimers | Johannes Daxenberger | Iryna Gurevych
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

There are two approaches for pairwise sentence scoring: Cross-encoders, which perform full-attention over the input pair, and Bi-encoders, which map each input independently to a dense vector space. While cross-encoders often achieve higher performance, they are too slow for many practical use cases. Bi-encoders, on the other hand, require substantial training data and fine-tuning over the target task to achieve competitive performance. We present a simple yet efficient data augmentation strategy called Augmented SBERT, where we use the cross-encoder to label a larger set of input pairs to augment the training data for the bi-encoder. We show that, in this process, selecting the sentence pairs is non-trivial and crucial for the success of the method. We evaluate our approach on multiple tasks (in-domain) as well as on a domain adaptation task. Augmented SBERT achieves an improvement of up to 6 points for in-domain and of up to 37 points for domain adaptation tasks compared to the original bi-encoder performance.

pdf bib
Aspect-Controlled Neural Argument Generation
Benjamin Schiller | Johannes Daxenberger | Iryna Gurevych
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We rely on arguments in our daily lives to deliver our opinions and base them on evidence, making them more convincing in turn. However, finding and formulating arguments can be challenging. In this work, we present the Arg-CTRL - a language model for argument generation that can be controlled to generate sentence-level arguments for a given topic, stance, and aspect. We define argument aspect detection as a necessary method to allow this fine-granular control and crowdsource a dataset with 5,032 arguments annotated with aspects. Our evaluation shows that the Arg-CTRL is able to generate high-quality, aspect-specific arguments, applicable to automatic counter-argument generation. We publish the model weights and all datasets and code to train the Arg-CTRL.

pdf bib
Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing
Nafise Sadat Moosavi | Iryna Gurevych | Angela Fan | Thomas Wolf | Yufang Hou | Ana Marasović | Sujith Ravi
Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing

pdf bib
Investigating label suggestions for opinion mining in German Covid-19 social media
Tilman Beck | Ji-Ung Lee | Christina Viehmann | Marcus Maurer | Oliver Quiring | Iryna Gurevych
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This work investigates the use of interactively updated label suggestions to improve upon the efficiency of gathering annotations on the task of opinion mining in German Covid-19 social media data. We develop guidelines to conduct a controlled annotation study with social science students and find that suggestions from a model trained on a small, expert-annotated dataset already lead to a substantial improvement – in terms of inter-annotator agreement (+.14 Fleiss’ κ) and annotation quality – compared to students that do not receive any label suggestions. We further find that label suggestions from interactively trained models do not lead to an improvement over suggestions from a static model. Nonetheless, our analysis of suggestion bias shows that annotators remain capable of reflecting upon the suggested label in general. Finally, we confirm the quality of the annotated data in transfer learning experiments between different annotator groups. To facilitate further research in opinion mining on social media data, we release our collected data consisting of 200 expert and 2,785 student annotations.

pdf bib
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Phillip Rust | Jonas Pfeiffer | Ivan Vulić | Sebastian Ruder | Iryna Gurevych
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model’s vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

pdf bib
Coreference Reasoning in Machine Reading Comprehension
Mingzhu Wu | Nafise Sadat Moosavi | Dan Roth | Iryna Gurevych
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Coreference resolution is essential for natural language understanding and has been long studied in NLP. In recent years, as the format of Question Answering (QA) became a standard for machine reading comprehension (MRC), there have been data collection efforts, e.g., Dasigi et al. (2019), that attempt to evaluate the ability of MRC models to reason about coreference. However, as we show, coreference reasoning in MRC is a greater challenge than earlier thought; MRC datasets do not reflect the natural distribution and, consequently, the challenges of coreference reasoning. Specifically, success on these datasets does not reflect a model’s proficiency in coreference reasoning. We propose a methodology for creating MRC datasets that better reflect the challenges of coreference reasoning and use it to create a sample evaluation set. The results on our dataset show that state-of-the-art models still struggle with these phenomena. Furthermore, we develop an effective way to use naturally occurring coreference phenomena from existing coreference resolution datasets when training MRC models. This allows us to show an improvement in the coreference reasoning abilities of state-of-the-art models.

pdf bib
Metaphor Generation with Conceptual Mappings
Kevin Stowe | Tuhin Chakrabarty | Nanyun Peng | Smaranda Muresan | Iryna Gurevych
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Generating metaphors is a difficult task as it requires understanding nuanced relationships between abstract concepts. In this paper, we aim to generate a metaphoric sentence given a literal expression by replacing relevant verbs. Guided by conceptual metaphor theory, we propose to control the generation process by encoding conceptual mappings between cognitive domains to generate meaningful metaphoric expressions. To achieve this, we develop two methods: 1) using FrameNet-based embeddings to learn mappings between domains and applying them at the lexical level (CM-Lex), and 2) deriving source/target pairs to train a controlled seq-to-seq generation model (CM-BART). We assess our methods through automatic and human evaluation for basic metaphoricity and conceptual metaphor presence. We show that the unsupervised CM-Lex model is competitive with recent deep learning metaphor generation systems, and CM-BART outperforms all other models both in automatic and human evaluations.

pdf bib
The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes
Nils Reimers | Iryna Gurevych
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. We show that this behavior is tightly connected to the number of dimensions of the representations: The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents

pdf bib
Event Coreference Data (Almost) for Free: Mining Hyperlinks from Online News
Michael Bugert | Iryna Gurevych
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Cross-document event coreference resolution (CDCR) is the task of identifying which event mentions refer to the same events throughout a collection of documents. Annotating CDCR data is an arduous and expensive process, explaining why existing corpora are small and lack domain coverage. To overcome this bottleneck, we automatically extract event coreference data from hyperlinks in online news: When referring to a significant real-world event, writers often add a hyperlink to another article covering this event. We demonstrate that collecting hyperlinks which point to the same article(s) produces extensive and high-quality CDCR data and create a corpus of 2M documents and 2.7M silver-standard event mentions called HyperCoref. We evaluate a state-of-the-art system on three CDCR corpora and find that models trained on small subsets of HyperCoref are highly competitive, with performance similar to models trained on gold-standard data. With our work, we free CDCR research from depending on costly human-annotated training data and open up possibilities for research beyond English CDCR, as our data extraction approach can be easily adapted to other languages.

pdf bib
Smelting Gold and Silver for Improved Multilingual AMR-to-Text Generation
Leonardo F. R. Ribeiro | Jonas Pfeiffer | Yue Zhang | Iryna Gurevych
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Recent work on multilingual AMR-to-text generation has exclusively focused on data augmentation strategies that utilize silver AMR. However, this assumes a high quality of generated AMRs, potentially limiting the transferability to the target task. In this paper, we investigate different techniques for automatically generating AMR annotations, where we aim to study which source of information yields better multilingual results. Our models trained on gold AMR with silver (machine translated) sentences outperform approaches which leverage generated silver AMR. We find that combining both complementary sources of information further improves multilingual AMR-to-text generation. Our models surpass the previous state of the art for German, Italian, Spanish, and Chinese by a large margin.

pdf bib
Structural Adapters in Pretrained Language Models for AMR-to-Text Generation
Leonardo F. R. Ribeiro | Yue Zhang | Iryna Gurevych
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Pretrained language models (PLM) have recently advanced graph-to-text generation, where the input graph is linearized into a sequence and fed into the PLM to obtain its representation. However, efficiently encoding the graph structure in PLMs is challenging because such models were pretrained on natural language, and modeling structured data may lead to catastrophic forgetting of distributional knowledge. In this paper, we propose StructAdapt, an adapter method to encode graph structure into PLMs. Contrary to prior work, StructAdapt effectively models interactions among the nodes based on the graph connectivity, only training graph structure-aware adapter parameters. In this way, we incorporate task-specific knowledge while maintaining the topological structure of the graph. We empirically show the benefits of explicitly encoding graph structure into PLMs using StructAdapt, outperforming the state of the art on two AMR-to-text datasets, training only 5.1% of the PLM parameters.

pdf bib
AdapterDrop: On the Efficiency of Adapters in Transformers
Andreas Rücklé | Gregor Geigle | Max Glockner | Tilman Beck | Jonas Pfeiffer | Nils Reimers | Iryna Gurevych
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Transformer models are expensive to fine-tune, slow for inference, and have large storage requirements. Recent approaches tackle these shortcomings by training smaller models, dynamically reducing the model size, and by training light-weight adapters. In this paper, we propose AdapterDrop, removing adapters from lower transformer layers during training and inference, which incorporates concepts from all three directions. We show that AdapterDrop can dynamically reduce the computational overhead when performing inference over multiple tasks simultaneously, with minimal decrease in task performances. We further prune adapters from AdapterFusion, which improves the inference efficiency while maintaining the task performances entirely.

pdf bib
Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning
Prasetya Utama | Nafise Sadat Moosavi | Victor Sanh | Iryna Gurevych
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Recent prompt-based approaches allow pretrained language models to achieve strong performances on few-shot finetuning by reformulating downstream tasks as a language modeling problem. In this work, we demonstrate that, despite its advantages on low data regimes, finetuned prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inference heuristics based on lexical overlap, e.g., models incorrectly assuming a sentence pair is of the same meaning because they consist of the same set of words. Interestingly, we find that this particular inference heuristic is significantly less present in the zero-shot evaluation of the prompt-based model, indicating how finetuning can be destructive to useful knowledge learned during the pretraining. We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning. Our evaluation on three datasets demonstrates promising improvements on the three corresponding challenge datasets used to diagnose the inference heuristics.

pdf bib
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts
Jonas Pfeiffer | Ivan Vulić | Iryna Gurevych | Sebastian Ruder
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Massively multilingual language models such as multilingual BERT offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. However, due to limited capacity and large differences in pretraining data sizes, there is a profound performance gap between resource-rich and resource-poor target languages. The ultimate challenge is dealing with under-resourced languages not covered at all by the models and written in scripts unseen during pretraining. In this work, we propose a series of novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. Relying on matrix factorization, our methods capitalize on the existing latent knowledge about multiple languages already available in the pretrained model’s embedding matrix. Furthermore, we show that learning of the new dedicated embedding matrix in the target language can be improved by leveraging a small number of vocabulary items (i.e., the so-called lexically overlapping tokens) shared between mBERT’s and target language vocabulary. Our adaptation techniques offer substantial performance gains for languages with unseen scripts. We also demonstrate that they can yield improvements for low-resource languages written in scripts covered by the pretrained model.

pdf bib
What to Pre-Train on? Efficient Intermediate Task Selection
Clifton Poth | Jonas Pfeiffer | Andreas Rücklé | Iryna Gurevych
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Intermediate task fine-tuning has been shown to culminate in large transfer gains across many NLP tasks. With an abundance of candidate datasets as well as pre-trained language models, it has become infeasible to experiment with all combinations to find the best transfer setting. In this work, we provide a comprehensive comparison of different methods for efficiently identifying beneficial tasks for intermediate transfer learning. We focus on parameter and computationally efficient adapter settings, highlight different data-availability scenarios, and provide expense estimates for each method. We experiment with a diverse set of 42 intermediate and 11 target English classification, multiple choice, question answering, and sequence tagging tasks. Our results demonstrate that efficient embedding based methods, which rely solely on the respective datasets, outperform computational expensive few-shot fine-tuning approaches. Our best methods achieve an average Regret@3 of 1% across all target tasks, demonstrating that we are able to efficiently identify the best datasets for intermediate training.

pdf bib
Human-In-The-LoopEntity Linking for Low Resource Domains
Jan-Christoph Klie | Richard Eckart de Castilho | Iryna Gurevych
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances

Entity linking (EL) is concerned with disambiguating entity mentions in a text against knowledge bases (KB). To quickly annotate texts with EL even in low-resource domains and noisy text, we present a novel Human-In-The-Loop EL approach. We show that it greatly outperforms a strong baseline in simulation. In a user study, annotation time is reduced by 35 % compared to annotating without interactive support; users report that they strongly prefer our system over ones without. An open-source and ready-to-use implementation based on the text annotation platform is made available.

2020

pdf bib
Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into Pretrained Transformers
Anne Lauscher | Olga Majewska | Leonardo F. R. Ribeiro | Iryna Gurevych | Nikolai Rozanov | Goran Glavaš
Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

Following the major success of neural language models (LMs) such as BERT or GPT-2 on a variety of language understanding tasks, recent work focused on injecting (structured) knowledge from external resources into these models. While on the one hand, joint pre-training (i.e., training from scratch, adding objectives based on external knowledge to the primary LM objective) may be prohibitively computationally expensive, post-hoc fine-tuning on external knowledge, on the other hand, may lead to the catastrophic forgetting of distributional knowledge. In this work, we investigate models for complementing the distributional knowledge of BERT with conceptual knowledge from ConceptNet and its corresponding Open Mind Common Sense (OMCS) corpus, respectively, using adapter training. While overall results on the GLUE benchmark paint an inconclusive picture, a deeper analysis reveals that our adapter-based models substantially outperform BERT (up to 15-20 performance points) on inference tasks that require the type of conceptual knowledge explicitly present in ConceptNet and OMCS. We also open source all our experiments and relevant code under: https://github.com/wluper/retrograph.

pdf bib
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics
Iryna Gurevych | Marianna Apidianaki | Manaal Faruqui
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

pdf bib
A matter of framing: The impact of linguistic formalism on probing results
Ilia Kuznetsov | Iryna Gurevych
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Deep pre-trained contextualized encoders like BERT demonstrate remarkable performance on a range of downstream tasks. A recent line of research in probing investigates the linguistic knowledge implicitly learned by these models during pre-training. While most work in probing operates on the task level, linguistic tasks are rarely uniform and can be represented in a variety of formalisms. Any linguistics-based probing study thereby inevitably commits to the formalism used to annotate the underlying data. Can the choice of formalism affect probing results? To investigate, we conduct an in-depth cross-formalism layer probing study in role semantics. We find linguistically meaningful differences in the encoding of semantic role- and proto-role information by BERT depending on the formalism and demonstrate that layer probing can detect subtle differences between the implementations of the same linguistic formalism. Our results suggest that linguistic formalism is an important dimension in probing studies, along with the commonly used cross-task and cross-lingual experimental settings.

pdf bib
MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale
Andreas Rücklé | Jonas Pfeiffer | Iryna Gurevych
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We study the zero-shot transfer capabilities of text matching models on a massive scale, by self-supervised training on 140 source domains from community question answering forums in English. We investigate the model performances on nine benchmarks of answer selection and question similarity tasks, and show that all 140 models transfer surprisingly well, where the large majority of models substantially outperforms common IR baselines. We also demonstrate that considering a broad selection of source domains is crucial for obtaining the best zero-shot transfer performances, which contrasts the standard procedure that merely relies on the largest and most similar domains. In addition, we extensively study how to best combine multiple source domains. We propose to incorporate self-supervised with supervised multi-task learning on all available source domains. Our best zero-shot transfer model considerably outperforms in-domain BERT and the previous state of the art on six benchmarks. Fine-tuning of our model with in-domain data results in additional large gains and achieves the new state of the art on all nine benchmarks.

pdf bib
Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
Nils Reimers | Iryna Gurevych
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models. The training is based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence. We use the original (monolingual) model to generate sentence embeddings for the source language and then train a new system on translated sentences to mimic the original model. Compared to other methods for training multilingual sentence embeddings, this approach has several advantages: It is easy to extend existing models with relatively few samples to new languages, it is easier to ensure desired properties for the vector space, and the hardware requirements for training are lower. We demonstrate the effectiveness of our approach for 50+ languages from various language families. Code to extend sentence embeddings models to more than 400 languages is publicly available.

pdf bib
Towards Debiasing NLU Models from Unknown Biases
Prasetya Ajie Utama | Nafise Sadat Moosavi | Iryna Gurevych
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

NLU models often exploit biases to achieve high dataset-specific performance without properly learning the intended task. Recently proposed debiasing methods are shown to be effective in mitigating this tendency. However, these methods rely on a major assumption that the types of bias should be known a-priori, which limits their application to many NLU tasks and datasets. In this work, we present the first step to bridge this gap by introducing a self-debiasing framework that prevents models from mainly utilizing biases without knowing them in advance. The proposed framework is general and complementary to the existing debiasing methods. We show that it allows these existing methods to retain the improvement on the challenge datasets (i.e., sets of examples designed to expose models’ reliance on biases) without specifically targeting certain biases. Furthermore, the evaluation suggests that applying the framework results in improved overall robustness.

pdf bib
MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer
Jonas Pfeiffer | Ivan Vulić | Iryna Gurevych | Sebastian Ruder
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The main goal behind state-of-the-art pre-trained multilingual models such as multilingual BERT and XLM-R is enabling and bootstrapping NLP applications in low-resource languages through zero-shot or few-shot cross-lingual transfer. However, due to limited model capacity, their transfer performance is the weakest exactly on such low-resource languages and languages unseen during pre-training. We propose MAD-X, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages by learning modular language and task representations. In addition, we introduce a novel invertible adapter architecture and a strong baseline method for adapting a pre-trained multilingual model to a new language. MAD-X outperforms the state of the art in cross lingual transfer across a representative set of typologically diverse languages on named entity recognition and causal commonsense reasoning, and achieves competitive results on question answering. Our code and adapters are available at AdapterHub.ml.

pdf bib
AdapterHub: A Framework for Adapting Transformers
Jonas Pfeiffer | Andreas Rücklé | Clifton Poth | Aishwarya Kamath | Ivan Vulić | Sebastian Ruder | Kyunghyun Cho | Iryna Gurevych
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

The current modus operandi in NLP involves downloading and fine-tuning pre-trained models consisting of millions or billions of parameters. Storing and sharing such large trained models is expensive, slow, and time-consuming, which impedes progress towards more general and versatile NLP methods that learn from and for many tasks. Adapters—small learnt bottleneck layers inserted within each layer of a pre-trained model— ameliorate this issue by avoiding full fine-tuning of the entire model. However, sharing and integrating adapter layers is not straightforward. We propose AdapterHub, a framework that allows dynamic “stiching-in” of pre-trained adapters for different tasks and languages. The framework, built on top of the popular HuggingFace Transformers library, enables extremely easy and quick adaptations of state-of-the-art pre-trained models (e.g., BERT, RoBERTa, XLM-R) across tasks and languages. Downloading, sharing, and training adapters is as seamless as possible using minimal changes to the training scripts and a specialized infrastructure. Our framework enables scalable and easy access to sharing of task-specific models, particularly in low-resource scenarios. AdapterHub includes all recent adapter architectures and can be found at AdapterHub.ml

pdf bib
LINSPECTOR: Multilingual Probing Tasks for Word Representations
Gözde Gül Şahin | Clara Vania | Ilia Kuznetsov | Iryna Gurevych
Computational Linguistics, Volume 46, Issue 2 - June 2020

Despite an ever-growing number of word representation models introduced for a large number of languages, there is a lack of a standardized technique to provide insights into what is captured by these models. Such insights would help the community to get an estimate of the downstream task performance, as well as to design more informed neural architectures, while avoiding extensive experimentation that requires substantial computational resources not all researchers have access to. A recent development in NLP is to use simple classification tasks, also called probing tasks, that test for a single linguistic feature such as part-of-speech. Existing studies mostly focus on exploring the linguistic information encoded by the continuous representations of English text. However, from a typological perspective the morphologically poor English is rather an outlier: The information encoded by the word order and function words in English is often stored on a subword, morphological level in other languages. To address this, we introduce 15 type-level probing tasks such as case marking, possession, word length, morphological tag count, and pseudoword identification for 24 languages. We present a reusable methodology for creation and evaluation of such tests in a multilingual setting, which is challenging because of a lack of resources, lower quality of tools, and differences among languages. We then present experiments on several diverse multilingual word embedding models, in which we relate the probing task performance for a diverse set of languages to a range of five classic NLP tasks: POS-tagging, dependency parsing, semantic role labeling, named entity recognition, and natural language inference. We find that a number of probing tests have significantly high positive correlation to the downstream tasks, especially for morphologically rich languages. We show that our tests can be used to explore word embeddings or black-box neural models for linguistic cues in a multilingual setting. We release the probing data sets and the evaluation suite LINSPECTOR with https://github.com/UKPLab/linspector.

pdf bib
Improving QA Generalization by Concurrent Modeling of Multiple Biases
Mingzhu Wu | Nafise Sadat Moosavi | Andreas Rücklé | Iryna Gurevych
Findings of the Association for Computational Linguistics: EMNLP 2020

Existing NLP datasets contain various biases that models can easily exploit to achieve high performances on the corresponding evaluation sets. However, focusing on dataset-specific biases limits their ability to learn more generalizable knowledge about the task from more general data patterns. In this paper, we investigate the impact of debiasing methods for improving generalization and propose a general framework for improving the performance on both in-domain and out-of-domain datasets by concurrent modeling of multiple biases in the training data. Our framework weights each example based on the biases it contains and the strength of those biases in the training data. It then uses these weights in the training objective so that the model relies less on examples with high bias weights. We extensively evaluate our framework on extractive question answering with training data from various domains with multiple biases of different strengths. We perform the evaluations in two different settings, in which the model is trained on a single domain or multiple domains simultaneously, and show its effectiveness in both settings compared to state-of-the-art debiasing methods.

pdf bib
Why do you think that? Exploring Faithful Sentence-Level Rationales Without Supervision
Max Glockner | Ivan Habernal | Iryna Gurevych
Findings of the Association for Computational Linguistics: EMNLP 2020

Evaluating the trustworthiness of a model’s prediction is essential for differentiating between ‘right for the right reasons’ and ‘right for the wrong reasons’. Identifying textual spans that determine the target label, known as faithful rationales, usually relies on pipeline approaches or reinforcement learning. However, such methods either require supervision and thus costly annotation of the rationales or employ non-differentiable models. We propose a differentiable training–framework to create models which output faithful rationales on a sentence level, by solely applying supervision on the target task. To achieve this, our model solves the task based on each rationale individually and learns to assign high scores to those which solved the task best. Our evaluation on three different datasets shows competitive results compared to a standard BERT blackbox while exceeding a pipeline counterpart’s performance in two cases. We further exploit the transparent decision–making process of these models to prefer selecting the correct rationales by applying direct supervision, thereby boosting the performance on the rationale–level.

pdf bib
Modeling Global and Local Node Contexts for Text Generation from Knowledge Graphs
Leonardo F. R. Ribeiro | Yue Zhang | Claire Gardent | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 8

Recent graph-to-text models generate text from graph-based data using either global or local aggregation to learn node representations. Global node encoding allows explicit communication between two distant nodes, thereby neglecting graph topology as all nodes are directly connected. In contrast, local node encoding considers the relations between neighbor nodes capturing the graph structure, but it can fail to capture long-range relations. In this work, we gather both encoding strategies, proposing novel neural models that encode an input graph combining both global and local node contexts, in order to learn better contextualized node embeddings. In our experiments, we demonstrate that our approaches lead to significant improvements on two graph-to-text datasets achieving BLEU scores of 18.01 on the AGENDA dataset, and 63.69 on the WebNLG dataset for seen categories, outperforming state-of-the-art models by 3.7 and 3.1 points, respectively.1

pdf bib
Interactive Text Ranking with Bayesian Optimization: A Case Study on Community QA and Summarization
Edwin Simpson | Yang Gao | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 8

For many NLP applications, such as question answering and summarization, the goal is to select the best solution from a large space of candidates to meet a particular user’s needs. To address the lack of user or task-specific training data, we propose an interactive text ranking approach that actively selects pairs of candidates, from which the user selects the best. Unlike previous strategies, which attempt to learn a ranking across the whole candidate space, our method uses Bayesian optimization to focus the user’s labeling effort on high quality candidates and integrate prior knowledge to cope better with small data scenarios. We apply our method to community question answering (cQA) and extractive multidocument summarization, finding that it significantly outperforms existing interactive approaches. We also show that the ranking function learned by our method is an effective reward function for reinforcement learning, which improves the state of the art for interactive summarization.

pdf bib
How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation
Steffen Eger | Johannes Daxenberger | Iryna Gurevych
Proceedings of the 24th Conference on Computational Natural Language Learning

Sentence encoders map sentences to real valued vectors for use in downstream applications. To peek into these representations—e.g., to increase interpretability of their results—probing tasks have been designed which query them for linguistic knowledge. However, designing probing tasks for lesser-resourced languages is tricky, because these often lack largescale annotated data or (high-quality) dependency parsers as a prerequisite of probing task design in English. To investigate how to probe sentence embeddings in such cases, we investigate sensitivity of probing task results to structural design choices, conducting the first such large scale study. We show that design choices like size of the annotated probing dataset and type of classifier used for evaluation do (sometimes substantially) influence probing outcomes. We then probe embeddings in a multilingual setup with design choices that lie in a ‘stable region’, as we identify for English, and find that results on English do not transfer to other languages. Fairer and more comprehensive sentence-level probing evaluation should thus be carried out on multiple languages in the future.

pdf bib
PuzzLing Machines: A Challenge on Learning From Small Data
Gözde Gül Şahin | Yova Kementchedjhieva | Phillip Rust | Iryna Gurevych
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Deep neural models have repeatedly proved excellent at memorizing surface patterns from large datasets for various ML and NLP benchmarks. They struggle to achieve human-like thinking, however, because they lack the skill of iterative reasoning upon knowledge. To expose this problem in a new light, we introduce a challenge on learning from small data, PuzzLing Machines, which consists of Rosetta Stone puzzles from Linguistic Olympiads for high school students. These puzzles are carefully designed to contain only the minimal amount of parallel text necessary to deduce the form of unseen expressions. Solving them does not require external information (e.g., knowledge bases, visual signals) or linguistic expertise, but meta-linguistic awareness and deductive skills. Our challenge contains around 100 puzzles covering a wide range of linguistic phenomena from 81 languages. We show that both simple statistical algorithms and state-of-the-art deep neural models perform inadequately on this challenge, as expected. We hope that this benchmark, available at https://ukplab.github.io/PuzzLing-Machines/, inspires further efforts towards a new paradigm in NLP—one that is grounded in human-like reasoning and understanding.

pdf bib
Dialogue Coherence Assessment Without Explicit Dialogue Act Labels
Mohsen Mesgar | Sebastian Bücker | Iryna Gurevych
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recent dialogue coherence models use the coherence features designed for monologue texts, e.g. nominal entities, to represent utterances and then explicitly augment them with dialogue-relevant features, e.g., dialogue act labels. It indicates two drawbacks, (a) semantics of utterances are limited to entity mentions, and (b) the performance of coherence models strongly relies on the quality of the input dialogue act labels. We address these issues by introducing a novel approach to dialogue coherence assessment. We use dialogue act prediction as an auxiliary task in a multi-task learning scenario to obtain informative utterance representations for coherence assessment. Our approach alleviates the need for explicit dialogue act labels during evaluation. The results of our experiments show that our model substantially (more than 20 accuracy points) outperforms its strong competitors on the DailyDialogue corpus, and performs on par with them on the SwitchBoard corpus for ranking dialogues concerning their coherence. We release our source code.

pdf bib
Empowering Active Learning to Jointly Optimize System and User Demands
Ji-Ung Lee | Christian M. Meyer | Iryna Gurevych
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Existing approaches to active learning maximize the system performance by sampling unlabeled instances for annotation that yield the most efficient training. However, when active learning is integrated with an end-user application, this can lead to frustration for participating users, as they spend time labeling instances that they would not otherwise be interested in reading. In this paper, we propose a new active learning approach that jointly optimizes the seemingly counteracting objectives of the active learning system (training efficiently) and the user (receiving useful instances). We study our approach in an educational application, which particularly benefits from this technique as the system needs to rapidly learn to predict the appropriateness of an exercise to a particular user, while the users should receive only exercises that match their skills. We evaluate multiple learning strategies and user types with data from real users and find that our joint approach better satisfies both objectives when alternative methods lead to many unsuitable exercises for end users.

pdf bib
From Zero to Hero: Human-In-The-Loop Entity Linking in Low Resource Domains
Jan-Christoph Klie | Richard Eckart de Castilho | Iryna Gurevych
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Entity linking (EL) is concerned with disambiguating entity mentions in a text against knowledge bases (KB). It is crucial in a considerable number of fields like humanities, technical writing and biomedical sciences to enrich texts with semantics and discover more knowledge. The use of EL in such domains requires handling noisy texts, low resource settings and domain-specific KBs. Existing approaches are mostly inappropriate for this, as they depend on training data. However, in the above scenario, there exists hardly annotated data, and it needs to be created from scratch. We therefore present a novel domain-agnostic Human-In-The-Loop annotation approach: we use recommenders that suggest potential concepts and adaptive candidate ranking, thereby speeding up the overall annotation process and making it less tedious for users. We evaluate our ranking approach in a simulation on difficult texts and show that it greatly outperforms a strong baseline in ranking accuracy. In a user study, the annotation speed improves by 35% compared to annotating without interactive support; users report that they strongly prefer our system. An open-source and ready-to-use implementation based on the text annotation platform INCEpTION (https://inception-project.github.io) is made available.

pdf bib
Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance
Prasetya Ajie Utama | Nafise Sadat Moosavi | Iryna Gurevych
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Models for natural language understanding (NLU) tasks often rely on the idiosyncratic biases of the dataset, which make them brittle against test cases outside the training distribution. Recently, several proposed debiasing methods are shown to be very effective in improving out-of-distribution performance. However, their improvements come at the expense of performance drop when models are evaluated on the in-distribution data, which contain examples with higher diversity. This seemingly inevitable trade-off may not tell us much about the changes in the reasoning and understanding capabilities of the resulting models on broader types of examples beyond the small subset represented in the out-of-distribution data. In this paper, we address this trade-off by introducing a novel debiasing method, called confidence regularization, which discourage models from exploiting biases while enabling them to receive enough incentive to learn from all the training examples. We evaluate our method on three NLU tasks and show that, in contrast to its predecessors, it improves the performance on out-of-distribution datasets (e.g., 7pp gain on HANS dataset) while maintaining the original in-distribution accuracy.

2019

pdf bib
Classification and Clustering of Arguments with Contextualized Word Embeddings
Nils Reimers | Benjamin Schiller | Tilman Beck | Johannes Daxenberger | Christian Stab | Iryna Gurevych
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We experiment with two recent contextualized word embedding methods (ELMo and BERT) in the context of open-domain argument search. For the first time, we show how to leverage the power of contextualized word embeddings to classify and cluster topic-dependent arguments, achieving impressive results on both tasks and across multiple datasets. For argument classification, we improve the state-of-the-art for the UKP Sentential Argument Mining Corpus by 20.8 percentage points and for the IBM Debater - Evidence Sentences dataset by 7.4 percentage points. For the understudied task of argument clustering, we propose a pre-training step which improves by 7.8 percentage points over strong baselines on a novel dataset, and by 12.3 percentage points for the Argument Facet Similarity (AFS) Corpus.

pdf bib
Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference
Tobias Falke | Leonardo F. R. Ribeiro | Prasetya Ajie Utama | Ido Dagan | Iryna Gurevych
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

While recent progress on abstractive summarization has led to remarkably fluent summaries, factual errors in generated summaries still severely limit their use in practice. In this paper, we evaluate summaries produced by state-of-the-art models via crowdsourcing and show that such errors occur frequently, in particular with more abstractive models. We study whether textual entailment predictions can be used to detect such errors and if they can be reduced by reranking alternative predicted summaries. That leads to an interesting downstream application for entailment models. In our experiments, we find that out-of-the-box entailment models trained on NLI datasets do not yet offer the desired performance for the downstream task and we therefore release our annotations as additional test data for future extrinsic evaluations of NLI.

pdf bib
Analysis of Automatic Annotation Suggestions for Hard Discourse-Level Tasks in Expert Domains
Claudia Schulz | Christian M. Meyer | Jan Kiesewetter | Michael Sailer | Elisabeth Bauer | Martin R. Fischer | Frank Fischer | Iryna Gurevych
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Many complex discourse-level tasks can aid domain experts in their work but require costly expert annotations for data creation. To speed up and ease annotations, we investigate the viability of automatically generated annotation suggestions for such tasks. As an example, we choose a task that is particularly hard for both humans and machines: the segmentation and classification of epistemic activities in diagnostic reasoning texts. We create and publish a new dataset covering two domains and carefully analyse the suggested annotations. We find that suggestions have positive effects on annotation speed and performance, while not introducing noteworthy biases. Envisioning suggestion models that improve with newly annotated texts, we contrast methods for continuous model adjustment and suggest the most effective setup for suggestions in future expert tasks.

pdf bib
Predicting Humorousness and Metaphor Novelty with Gaussian Process Preference Learning
Edwin Simpson | Erik-Lân Do Dinh | Tristan Miller | Iryna Gurevych
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The inability to quantify key aspects of creative language is a frequent obstacle to natural language understanding. To address this, we introduce novel tasks for evaluating the creativeness of language—namely, scoring and ranking text by humorousness and metaphor novelty. To sidestep the difficulty of assigning discrete labels or numeric scores, we learn from pairwise comparisons between texts. We introduce a Bayesian approach for predicting humorousness and metaphor novelty using Gaussian process preference learning (GPPL), which achieves a Spearman’s ρ of 0.56 against gold using word embeddings and linguistic features. Our experiments show that given sparse, crowdsourced annotation data, ranking using GPPL outperforms best–worst scaling. We release a new dataset for evaluating humour containing 28,210 pairwise comparisons of 4,030 texts, and make our software freely available.

pdf bib
Pitfalls in the Evaluation of Sentence Embeddings
Steffen Eger | Andreas Rücklé | Iryna Gurevych
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

Deep learning models continuously break new records across different NLP tasks. At the same time, their success exposes weaknesses of model evaluation. Here, we compile several key pitfalls of evaluation of sentence embeddings, a currently very popular NLP paradigm. These pitfalls include the comparison of embeddings of different sizes, normalization of embeddings, and the low (and diverging) correlations between transfer and probing tasks. Our motivation is to challenge the current evaluation of sentence embeddings and to provide an easy-to-access reference for future research. Based on our insights, we also recommend better practices for better future evaluations of sentence embeddings.

pdf bib
Revisiting the Binary Linearization Technique for Surface Realization
Yevgeniy Puzikov | Claire Gardent | Ido Dagan | Iryna Gurevych
Proceedings of the 12th International Conference on Natural Language Generation

End-to-end neural approaches have achieved state-of-the-art performance in many natural language processing (NLP) tasks. Yet, they often lack transparency of the underlying decision-making process, hindering error analysis and certain model improvements. In this work, we revisit the binary linearization approach to surface realization, which exhibits more interpretable behavior, but was falling short in terms of prediction accuracy. We show how enriching the training data to better capture word order constraints almost doubles the performance of the system. We further demonstrate that encoding both local and global prediction contexts yields another considerable performance boost. With the proposed modifications, the system which ranked low in the latest shared task on multilingual surface realization now achieves best results in five out of ten languages, while being on par with the state-of-the-art approaches in others.

pdf bib
Fast Concept Mention Grouping for Concept Map-based Multi-Document Summarization
Tobias Falke | Iryna Gurevych
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Concept map-based multi-document summarization has recently been proposed as a variant of the traditional summarization task with graph-structured summaries. As shown by previous work, the grouping of coreferent concept mentions across documents is a crucial subtask of it. However, while the current state-of-the-art method suggested a new grouping method that was shown to improve the summary quality, its use of pairwise comparisons leads to polynomial runtime complexity that prohibits the application to large document collections. In this paper, we propose two alternative grouping techniques based on locality sensitive hashing, approximate nearest neighbor search and a fast clustering algorithm. They exhibit linear and log-linear runtime complexity, making them much more scalable. We report experimental results that confirm the improved runtime behavior while also showing that the quality of the summary concept maps remains comparable.

pdf bib
Does My Rebuttal Matter? Insights from a Major NLP Conference
Yang Gao | Steffen Eger | Ilia Kuznetsov | Iryna Gurevych | Yusuke Miyao
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Peer review is a core element of the scientific process, particularly in conference-centered fields such as ML and NLP. However, only few studies have evaluated its properties empirically. Aiming to fill this gap, we present a corpus that contains over 4k reviews and 1.2k author responses from ACL-2018. We quantitatively and qualitatively assess the corpus. This includes a pilot study on paper weaknesses given by reviewers and on quality of author responses. We then focus on the role of the rebuttal phase, and propose a novel task to predict after-rebuttal (i.e., final) scores from initial reviews and author responses. Although author responses do have a marginal (and statistically significant) influence on the final scores, especially for borderline papers, our results suggest that a reviewer’s final score is largely determined by her initial score and the distance to the other reviewers’ initial scores. In this context, we discuss the conformity bias inherent to peer reviewing, a bias that has largely been overlooked in previous research. We hope our analyses will help better assess the usefulness of the rebuttal phase in NLP conferences.

pdf bib
Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems
Steffen Eger | Gözde Gül Şahin | Andreas Rücklé | Ji-Ung Lee | Claudia Schulz | Mohsen Mesgar | Krishnkant Swarnkar | Edwin Simpson | Iryna Gurevych
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Visual modifications to text are often used to obfuscate offensive comments in social media (e.g., “!d10t”) or as a writing style (“1337” in “leet speak”), among other scenarios. We consider this as a new type of adversarial attack in NLP, a setting to which humans are very robust, as our experiments with both simple and more difficult visual perturbations demonstrate. We investigate the impact of visual adversarial attacks on current NLP systems on character-, word-, and sentence-level tasks, showing that both neural and non-neural models are, in contrast to humans, extremely sensitive to such attacks, suffering performance decreases of up to 82%. We then explore three shielding methods—visual character embeddings, adversarial training, and rule-based recovery—which substantially improve the robustness of the models. However, the shielding methods still fall behind performances achieved in non-attack scenarios, which demonstrates the difficulty of dealing with visual attacks.

pdf bib
A Streamlined Method for Sourcing Discourse-level Argumentation Annotations from the Crowd
Tristan Miller | Maria Sukhareva | Iryna Gurevych
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The study of argumentation and the development of argument mining tools depends on the availability of annotated data, which is challenging to obtain in sufficient quantity and quality. We present a method that breaks down a popular but relatively complex discourse-level argument annotation scheme into a simpler, iterative procedure that can be applied even by untrained annotators. We apply this method in a crowdsourcing setup and report on the reliability of the annotations obtained. The source code for a tool implementing our annotation method, as well as the sample data we obtained (4909 gold-standard annotations across 982 documents), are freely released to the research community. These are intended to serve the needs of qualitative research into argumentation, as well as of data-driven approaches to argument mining.

pdf bib
A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking
Andreas Hanselowski | Christian Stab | Claudia Schulz | Zile Li | Iryna Gurevych
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Automated fact-checking based on machine learning is a promising approach to identify false information distributed on the web. In order to achieve satisfactory performance, machine learning methods require a large corpus with reliable annotations for the different tasks in the fact-checking process. Having analyzed existing fact-checking corpora, we found that none of them meets these criteria in full. They are either too small in size, do not provide detailed annotations, or are limited to a single domain. Motivated by this gap, we present a new substantially sized mixed-domain corpus with annotations of good quality for the core fact-checking tasks: document retrieval, evidence extraction, stance detection, and claim validation. To aid future corpus construction, we describe our methodology for corpus creation and annotation, and demonstrate that it results in substantial inter-annotator agreement. As baselines for future research, we perform experiments on our corpus with a number of model architectures that reach high performance in similar problem settings. Finally, to support the development of future models, we provide a detailed error analysis for each of the tasks. Our results show that the realistic, multi-domain setting defined by our data poses new challenges for the existing models, providing opportunities for considerable improvement by future systems.

pdf bib
A Bayesian Approach for Sequence Tagging with Crowds
Edwin Simpson | Iryna Gurevych
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Current methods for sequence tagging, a core task in NLP, are data hungry, which motivates the use of crowdsourcing as a cheap way to obtain labelled data. However, annotators are often unreliable and current aggregation methods cannot capture common types of span annotation error. To address this, we propose a Bayesian method for aggregating sequence tags that reduces errors by modelling sequential dependencies between the annotations as well as the ground-truth labels. By taking a Bayesian approach, we account for uncertainty in the model due to both annotator errors and the lack of data for modelling annotators who complete few tasks. We evaluate our model on crowdsourced data for named entity recognition, information extraction and argument mining, showing that our sequential model outperforms the previous state of the art, and that Bayesian approaches outperform non-Bayesian alternatives. We also find that our approach can reduce crowdsourcing costs through more effective active learning, as it better captures uncertainty in the sequence labels when there are few annotations.

pdf bib
Neural Duplicate Question Detection without Labeled Training Data
Andreas Rücklé | Nafise Sadat Moosavi | Iryna Gurevych
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Supervised training of neural models to duplicate question detection in community Question Answering (CQA) requires large amounts of labeled question pairs, which can be costly to obtain. To minimize this cost, recent works thus often used alternative methods, e.g., adversarial domain adaptation. In this work, we propose two novel methods—weak supervision using the title and body of a question, and the automatic generation of duplicate questions—and show that both can achieve improved performances even though they do not require any labeled data. We provide a comparison of popular training strategies and show that our proposed approaches are more effective in many cases because they can utilize larger amounts of data from the CQA forums. Finally, we show that weak supervision with question title and body information is also an effective method to train CQA answer selection models without direct answer supervision.

pdf bib
Better Rewards Yield Better Summaries: Learning to Summarise Without References
Florian Böhm | Yang Gao | Christian M. Meyer | Ori Shapira | Ido Dagan | Iryna Gurevych
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Reinforcement Learning (RL)based document summarisation systems yield state-of-the-art performance in terms of ROUGE scores, because they directly use ROUGE as the rewards during training. However, summaries with high ROUGE scores often receive low human judgement. To find a better reward function that can guide RL to generate human-appealing summaries, we learn a reward function from human ratings on 2,500 summaries. Our reward function only takes the document and system summary as input. Hence, once trained, it can be used to train RL based summarisation systems without using any reference summaries. We show that our learned rewards have significantly higher correlation with human ratings than previous approaches. Human evaluation experiments show that, compared to the state-of-the-art supervised-learning systems and ROUGE-as-rewards RL summarisation systems, the RL systems using our learned rewards during training generate summaries with higher human ratings. The learned reward function and our source code are available at https://github.com/yg211/summary-reward-no-reference.

pdf bib
Enhancing AMR-to-Text Generation with Dual Graph Representations
Leonardo F. R. Ribeiro | Claire Gardent | Iryna Gurevych
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Generating text from graph-based data, such as Abstract Meaning Representation (AMR), is a challenging task due to the inherent difficulty in how to properly encode the structure of a graph with labeled edges. To address this difficulty, we propose a novel graph-to-sequence model that encodes different but complementary perspectives of the structural information contained in the AMR graph. The model learns parallel top-down and bottom-up representations of nodes capturing contrasting views of the graph. We also investigate the use of different node message passing strategies, employing different state-of-the-art graph encoders to compute node representations based on incoming and outgoing perspectives. In our experiments, we demonstrate that the dual graph representation leads to improvements in AMR-to-text generation, achieving state-of-the-art results on two AMR datasets

pdf bib
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers | Iryna Gurevych
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods.

pdf bib
FAMULUS: Interactive Annotation and Feedback Generation for Teaching Diagnostic Reasoning
Jonas Pfeiffer | Christian M. Meyer | Claudia Schulz | Jan Kiesewetter | Jan Zottmann | Michael Sailer | Elisabeth Bauer | Frank Fischer | Martin R. Fischer | Iryna Gurevych
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

Our proposed system FAMULUS helps students learn to diagnose based on automatic feedback in virtual patient simulations, and it supports instructors in labeling training data. Diagnosing is an exceptionally difficult skill to obtain but vital for many different professions (e.g., medical doctors, teachers). Previous case simulation systems are limited to multiple-choice questions and thus cannot give constructive individualized feedback on a student’s diagnostic reasoning process. Given initially only limited data, we leverage a (replaceable) NLP model to both support experts in their further data annotation with automatic suggestions, and we provide automatic feedback for students. We argue that because the central model consistently improves, our interactive approach encourages both students and instructors to recurrently use the tool, and thus accelerate the speed of data creation and annotation. We show results from two user studies on diagnostic reasoning in medicine and teacher education and outline how our system can be extended to further use cases.

pdf bib
LINSPECTOR WEB: A Multilingual Probing Suite for Word Representations
Max Eichler | Gözde Gül Şahin | Iryna Gurevych
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

We present LINSPECTOR WEB , an open source multilingual inspector to analyze word representations. Our system provides researchers working in low-resource settings with an easily accessible web based probing tool to gain quick insights into their word embeddings especially outside of the English language. To do this we employ 16 simple linguistic probing tasks such as gender, case marking, and tense for a diverse set of 28 languages. We support probing of static word embeddings along with pretrained AllenNLP models that are commonly used for NLP downstream tasks such as named entity recognition, natural language inference and dependency parsing. The results are visualized in a polar chart and also provided as a table. LINSPECTOR WEB is available as an offline tool or at https://linspector.ukp.informatik.tu-darmstadt.de.

2018

pdf bib
Weeding out Conventionalized Metaphors: A Corpus of Novel Metaphor Annotations
Erik-Lân Do Dinh | Hannah Wieland | Iryna Gurevych
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We encounter metaphors every day, but only a few jump out on us and make us stumble. However, little effort has been devoted to investigating more novel metaphors in comparison to general metaphor detection efforts. We attribute this gap primarily to the lack of larger datasets that distinguish between conventionalized, i.e., very common, and novel metaphors. The goal of this paper is to alleviate this situation by introducing a crowdsourced novel metaphor annotation layer for an existing metaphor corpus. Further, we analyze our corpus and investigate correlations between novelty and features that are typically used in metaphor detection, such as concreteness ratings and more semantic features like the Potential for Metaphoricity. Finally, we present a baseline approach to assess novelty in metaphors based on our annotations.

pdf bib
Cross-topic Argument Mining from Heterogeneous Sources
Christian Stab | Tristan Miller | Benjamin Schiller | Pranav Rai | Iryna Gurevych
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Argument mining is a core technology for automating argument search in large document collections. Despite its usefulness for this task, most current approaches are designed for use only with specific text types and fall short when applied to heterogeneous texts. In this paper, we propose a new sentential annotation scheme that is reliably applicable by crowd workers to arbitrary Web texts. We source annotations for over 25,000 instances covering eight controversial topics. We show that integrating topic information into bidirectional long short-term memory networks outperforms vanilla BiLSTMs by more than 3 percentage points in F1 in two- and three-label cross-topic settings. We also show that these results can be further improved by leveraging additional data for topic relevance using multi-task learning.

pdf bib
APRIL: Interactively Learning to Summarise by Combining Active Preference Learning and Reinforcement Learning
Yang Gao | Christian M. Meyer | Iryna Gurevych
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We propose a method to perform automatic document summarisation without using reference summaries. Instead, our method interactively learns from users’ preferences. The merit of preference-based interactive summarisation is that preferences are easier for users to provide than reference summaries. Existing preference-based interactive learning methods suffer from high sample complexity, i.e. they need to interact with the oracle for many rounds in order to converge. In this work, we propose a new objective function, which enables us to leverage active learning, preference learning and reinforcement learning techniques in order to reduce the sample complexity. Both simulation and real-user experiments suggest that our method significantly advances the state of the art. Our source code is freely available at https://github.com/UKPLab/emnlp2018-april.

pdf bib
Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks
Steffen Eger | Paul Youssef | Iryna Gurevych
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Activation functions play a crucial role in neural networks because they are the nonlinearities which have been attributed to the success story of deep learning. One of the currently most popular activation functions is ReLU, but several competitors have recently been proposed or ‘discovered’, including LReLU functions and swish. While most works compare newly proposed activation functions on few tasks (usually from image classification) and against few competitors (usually ReLU), we perform the first largescale comparison of 21 activation functions across eight different NLP tasks. We find that a largely unknown activation function performs most stably across all tasks, the so-called penalized tanh function. We also show that it can successfully replace the sigmoid and tanh gates in LSTM cells, leading to a 2 percentage point (pp) improvement over the standard choices on a challenging NLP task.

pdf bib
Interactive Instance-based Evaluation of Knowledge Base Question Answering
Daniil Sorokin | Iryna Gurevych
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Most approaches to Knowledge Base Question Answering are based on semantic parsing. In this paper, we present a tool that aids in debugging of question answering systems that construct a structured semantic representation for the input question. Previous work has largely focused on building question answering interfaces or evaluation frameworks that unify multiple data sets. The primary objective of our system is to enable interactive debugging of model predictions on individual instances (questions) and to simplify manual error analysis. Our interactive interface helps researchers to understand the shortcomings of a particular model, qualitatively analyze the complete pipeline and compare different models. A set of sit-by sessions was used to validate our interface design.

pdf bib
Integrating Knowledge-Supported Search into the INCEpTION Annotation Platform
Beto Boullosa | Richard Eckart de Castilho | Naveen Kumar | Jan-Christoph Klie | Iryna Gurevych
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Annotating entity mentions and linking them to a knowledge resource are essential tasks in many domains. It disambiguates mentions, introduces cross-document coreferences, and the resources contribute extra information, e.g. taxonomic relations. Such tasks benefit from text annotation tools that integrate a search which covers the text, the annotations, as well as the knowledge resource. However, to the best of our knowledge, no current tools integrate knowledge-supported search as well as entity linking support. We address this gap by introducing knowledge-supported search functionality into the INCEpTION text annotation platform. In our approach, cross-document references are created by linking entity mentions to a knowledge base in the form of a structured hierarchical vocabulary. The resulting annotations are then indexed to enable fast and yet complex queries taking into account the text, the annotations, and the vocabulary structure.

pdf bib
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Iryna Gurevych | Yusuke Miyao
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Iryna Gurevych | Yusuke Miyao
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
From Text to Lexicon: Bridging the Gap between Word Embeddings and Lexical Resources
Ilia Kuznetsov | Iryna Gurevych
Proceedings of the 27th International Conference on Computational Linguistics

Distributional word representations (often referred to as word embeddings) are omnipresent in modern NLP. Early work has focused on building representations for word types, and recent studies show that lemmatization and part of speech (POS) disambiguation of targets in isolation improve the performance of word embeddings on a range of downstream tasks. However, the reasons behind these improvements, the qualitative effects of these operations and the combined performance of lemmatized and POS disambiguated targets are less studied. This work aims to close this gap and puts previous findings into a general perspective. We examine the effect of lemmatization and POS typing on word embedding performance in a novel resource-based evaluation scenario, as well as on standard similarity benchmarks. We show that these two operations have complimentary qualitative and vocabulary-level effects and are best used in combination. We find that the improvement is more pronounced for verbs and show how lemmatization and POS typing implicitly target some of the verb-specific issues. We claim that the observed improvement is a result of better conceptual alignment between word embeddings and lexical resources, stressing the need for conceptually plausible modeling of word embedding targets.

pdf bib
Cross-lingual Argumentation Mining: Machine Translation (and a bit of Projection) is All You Need!
Steffen Eger | Johannes Daxenberger | Christian Stab | Iryna Gurevych
Proceedings of the 27th International Conference on Computational Linguistics

Argumentation mining (AM) requires the identification of complex discourse structures and has lately been applied with success monolingually. In this work, we show that the existing resources are, however, not adequate for assessing cross-lingual AM, due to their heterogeneity or lack of complexity. We therefore create suitable parallel corpora by (human and machine) translating a popular AM dataset consisting of persuasive student essays into German, French, Spanish, and Chinese. We then compare (i) annotation projection and (ii) bilingual word embeddings based direct transfer strategies for cross-lingual AM, finding that the former performs considerably better and almost eliminates the loss from cross-lingual transfer. Moreover, we find that annotation projection works equally well when using either costly human or cheap machine translations. Our code and data are available at http://github.com/UKPLab/coling2018-xling_argument_mining.

pdf bib
Killing Four Birds with Two Stones: Multi-Task Learning for Non-Literal Language Detection
Erik-Lân Do Dinh | Steffen Eger | Iryna Gurevych
Proceedings of the 27th International Conference on Computational Linguistics

Non-literal language phenomena such as idioms or metaphors are commonly studied in isolation from each other in NLP. However, often similar definitions and features are being used for different phenomena, challenging the distinction. Instead, we propose to view the detection problem as a generalized non-literal language classification problem. In this paper we investigate multi-task learning for related non-literal language phenomena. We show that in contrast to simply joining the data of multiple tasks, multi-task learning consistently improves upon four metaphor and idiom detection tasks in two languages, English and German. Comparing two state-of-the-art multi-task learning architectures, we also investigate when soft parameter sharing and learned information flow can be beneficial for our related tasks. We make our adapted code publicly available.

pdf bib
A Retrospective Analysis of the Fake News Challenge Stance-Detection Task
Andreas Hanselowski | Avinesh PVS | Benjamin Schiller | Felix Caspelherr | Debanjan Chaudhuri | Christian M. Meyer | Iryna Gurevych
Proceedings of the 27th International Conference on Computational Linguistics

The 2017 Fake News Challenge Stage 1 (FNC-1) shared task addressed a stance classification task as a crucial first step towards detecting fake news. To date, there is no in-depth analysis paper to critically discuss FNC-1’s experimental setup, reproduce the results, and draw conclusions for next-generation stance classification methods. In this paper, we provide such an in-depth analysis for the three top-performing systems. We first find that FNC-1’s proposed evaluation metric favors the majority class, which can be easily classified, and thus overestimates the true discriminative power of the methods. Therefore, we propose a new F1-based metric yielding a changed system ranking. Next, we compare the features and architectures used, which leads to a novel feature-rich stacked LSTM model that performs on par with the best systems, but is superior in predicting minority classes. To understand the methods’ ability to generalize, we derive a new dataset and perform both in-domain and cross-domain experiments. Our qualitative and quantitative study helps interpreting the original FNC-1 scores and understand which features help improving performance and why. Our new dataset and all source code used during the reproduction study are publicly available for future research.

pdf bib
Multimodal Grounding for Language Processing
Lisa Beinborn | Teresa Botschen | Iryna Gurevych
Proceedings of the 27th International Conference on Computational Linguistics

This survey discusses how recent developments in multimodal processing facilitate conceptual grounding of language. We categorize the information flow in multimodal processing with respect to cognitive models of human information processing and analyze different methods for combining multimodal representations. Based on this methodological inventory, we discuss the benefit of multimodal grounding for a variety of language processing tasks and the challenges that arise. We particularly focus on multimodal grounding of verbs which play a crucial role for the compositional power of language.

pdf bib
Modeling Semantics with Gated Graph Neural Networks for Knowledge Base Question Answering
Daniil Sorokin | Iryna Gurevych
Proceedings of the 27th International Conference on Computational Linguistics

The most approaches to Knowledge Base Question Answering are based on semantic parsing. In this paper, we address the problem of learning vector representations for complex semantic parses that consist of multiple entities and relations. Previous work largely focused on selecting the correct semantic relations for a question and disregarded the structure of the semantic parse: the connections between entities and the directions of the relations. We propose to use Gated Graph Neural Networks to encode the graph structure of the semantic parse. We show on two data sets that the graph networks outperform all baseline models that do not explicitly model the structure. The error analysis confirms that our approach can successfully process complex semantic parses.

pdf bib
The INCEpTION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation
Jan-Christoph Klie | Michael Bugert | Beto Boullosa | Richard Eckart de Castilho | Iryna Gurevych
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

We introduce INCEpTION, a new annotation platform for tasks including interactive and semantic annotation (e.g., concept linking, fact linking, knowledge base population, semantic frame annotation). These tasks are very time consuming and demanding for annotators, especially when knowledge bases are used. We address these issues by developing an annotation platform that incorporates machine learning capabilities which actively assist and guide annotators. The platform is both generic and modular. It targets a range of research domains in need of semantic annotation, such as digital humanities, bioinformatics, or linguistics. INCEpTION is publicly available as open-source software.

pdf bib
Lexical-semantic resources: yet powerful resources for automatic personality classification
Xuan-Son Vu | Lucie Flekova | Lili Jiang | Iryna Gurevych
Proceedings of the 9th Global Wordnet Conference

In this paper, we aim to reveal the impact of lexical-semantic resources, used in particular for word sense disambiguation and sense-level semantic categorization, on automatic personality classification task. While stylistic features (e.g., part-of-speech counts) have been shown their power in this task, the impact of semantics beyond targeted word lists is relatively unexplored. We propose and extract three types of lexical-semantic features, which capture high-level concepts and emotions, overcoming the lexical gap of word n-grams. Our experimental results are comparable to state-of-the-art methods, while no personality-specific resources are required.

pdf bib
SemEval-2018 Task 12: The Argument Reasoning Comprehension Task
Ivan Habernal | Henning Wachsmuth | Iryna Gurevych | Benno Stein
Proceedings of The 12th International Workshop on Semantic Evaluation

A natural language argument is composed of a claim as well as reasons given as premises for the claim. The warrant explaining the reasoning is usually left implicit, as it is clear from the context and common sense. This makes a comprehension of arguments easy for humans but hard for machines. This paper summarizes the first shared task on argument reasoning comprehension. Given a premise and a claim along with some topic information, the goal was to automatically identify the correct warrant among two candidates that are plausible and lexically close, but in fact imply opposite claims. We describe the dataset with 1970 instances that we built for the task, and we outline the 21 computational approaches that participated, most of which used neural networks. The results reveal the complexity of the task, with many approaches hardly improving over the random accuracy of about 0.5. Still, the best observed accuracy (0.712) underlines the principle feasibility of identifying warrants. Our analysis indicates that an inclusion of external knowledge is key to reasoning comprehension.

pdf bib
Mixing Context Granularities for Improved Entity Linking on Question Answering Data across Entity Categories
Daniil Sorokin | Iryna Gurevych
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics

The first stage of every knowledge base question answering approach is to link entities in the input question. We investigate entity linking in the context of question answering task and present a jointly optimized neural architecture for entity mention detection and entity disambiguation that models the surrounding context on different levels of granularity. We use the Wikidata knowledge base and available question answering datasets to create benchmarks for entity linking on question answering data. Our approach outperforms the previous state-of-the-art system on this data, resulting in an average 8% improvement of the final score. We further demonstrate that our model delivers a strong performance across different entity categories.

pdf bib
A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning
Hatem Mousselly-Sergieh | Teresa Botschen | Iryna Gurevych | Stefan Roth
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics

Current methods for knowledge graph (KG) representation learning focus solely on the structure of the KG and do not exploit any kind of external information, such as visual and linguistic information corresponding to the KG entities. In this paper, we propose a multimodal translation-based approach that defines the energy of a KG triple as the sum of sub-energy functions that leverage both multimodal (visual and linguistic) and structural KG representations. Next, a ranking-based loss is minimized using a simple neural network architecture. Moreover, we introduce a new large-scale dataset for multimodal KG representation learning. We compared the performance of our approach to other baselines on two standard tasks, namely knowledge graph completion and triple classification, using our as well as the WN9-IMG dataset. The results demonstrate that our approach outperforms all baselines on both tasks and datasets.

pdf bib
Before Name-Calling: Dynamics and Triggers of Ad Hominem Fallacies in Web Argumentation
Ivan Habernal | Henning Wachsmuth | Iryna Gurevych | Benno Stein
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Arguing without committing a fallacy is one of the main requirements of an ideal debate. But even when debating rules are strictly enforced and fallacious arguments punished, arguers often lapse into attacking the opponent by an ad hominem argument. As existing research lacks solid empirical investigation of the typology of ad hominem arguments as well as their potential causes, this paper fills this gap by (1) performing several large-scale annotation studies, (2) experimenting with various neural architectures and validating our working hypotheses, such as controversy or reasonableness, and (3) providing linguistic insights into triggers of ad hominem using explainable neural network architectures.

pdf bib
Multimodal Frame Identification with Multilingual Evaluation
Teresa Botschen | Iryna Gurevych | Jan-Christoph Klie | Hatem Mousselly-Sergieh | Stefan Roth
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

An essential step in FrameNet Semantic Role Labeling is the Frame Identification (FrameId) task, which aims at disambiguating a situation around a predicate. Whilst current FrameId methods rely on textual representations only, we hypothesize that FrameId can profit from a richer understanding of the situational context. Such contextual information can be obtained from common sense knowledge, which is more present in images than in text. In this paper, we extend a state-of-the-art FrameId system in order to effectively leverage multimodal representations. We conduct a comprehensive evaluation on the English FrameNet and its German counterpart SALSA. Our analysis shows that for the German data, textual representations are still competitive with multimodal ones. However on the English data, our multimodal FrameId approach outperforms its unimodal counterpart, setting a new state of the art. Its benefits are particularly apparent in dealing with ambiguous and rare instances, the main source of errors of current systems. For research purposes, we release (a) the implementation of our system, (b) our evaluation splits for SALSA 2.0, and (c) the embeddings for synsets and IMAGINED words.

pdf bib
The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants
Ivan Habernal | Henning Wachsmuth | Iryna Gurevych | Benno Stein
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Reasoning is a crucial part of natural language argumentation. To comprehend an argument, one must analyze its warrant, which explains why its claim follows from its premises. As arguments are highly contextualized, warrants are usually presupposed and left implicit. Thus, the comprehension does not only require language understanding and logic skills, but also depends on common sense. In this paper we develop a methodology for reconstructing warrants systematically. We operationalize it in a scalable crowdsourcing process, resulting in a freely licensed dataset with warrants for 2k authentic arguments from news comments. On this basis, we present a new challenging task, the argument reasoning comprehension task. Given an argument with a claim and a premise, the goal is to choose the correct implicit warrant from two options. Both warrants are plausible and lexically close, but lead to contradicting claims. A solution to this task will define a substantial step towards automatic warrant reconstruction. However, experiments with several neural attention and language models reveal that current approaches do not suffice.

pdf bib
Multi-Task Learning for Argumentation Mining in Low-Resource Settings
Claudia Schulz | Steffen Eger | Johannes Daxenberger | Tobias Kahse | Iryna Gurevych
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We investigate whether and where multi-task learning (MTL) can improve performance on NLP problems related to argumentation mining (AM), in particular argument component identification. Our results show that MTL performs particularly well (and better than single-task learning) when little training data is available for the main task, a common scenario in AM. Our findings challenge previous assumptions that conceptualizations across AM datasets are divergent and that MTL is difficult for semantic or higher-level tasks.

pdf bib
Objective Function Learning to Match Human Judgements for Optimization-Based Summarization
Maxime Peyrard | Iryna Gurevych
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Supervised summarization systems usually rely on supervision at the sentence or n-gram level provided by automatic metrics like ROUGE, which act as noisy proxies for human judgments. In this work, we learn a summary-level scoring function 𝜃 including human judgments as supervision and automatically generated data as regularization. We extract summaries with a genetic algorithm using 𝜃 as a fitness function. We observe strong and promising performances across datasets in both automatic and manual evaluation.

pdf bib
ArgumenText: Searching for Arguments in Heterogeneous Sources
Christian Stab | Johannes Daxenberger | Chris Stahlhut | Tristan Miller | Benjamin Schiller | Christopher Tauchmann | Steffen Eger | Iryna Gurevych
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

Argument mining is a core technology for enabling argument search in large corpora. However, most current approaches fall short when applied to heterogeneous texts. In this paper, we present an argument retrieval system capable of retrieving sentential arguments for any given controversial topic. By analyzing the highest-ranked results extracted from Web sources, we found that our system covers 89% of arguments found in expert-curated lists of arguments from an online debate portal, and also identifies additional valid arguments.

pdf bib
BinLin: A Simple Method of Dependency Tree Linearization
Yevgeniy Puzikov | Iryna Gurevych
Proceedings of the First Workshop on Multilingual Surface Realisation

Surface Realization Shared Task 2018 is a workshop on generating sentences from lemmatized sets of dependency triples. This paper describes the results of our participation in the challenge. We develop a data-driven pipeline system which first orders the lemmas and then conjugates the words to finish the surface realization process. Our contribution is a novel sequential method of ordering lemmas, which, despite its simplicity, achieves promising results. We demonstrate the effectiveness of the proposed approach, describe its limitations and outline ways to improve it.

pdf bib
One Size Fits All? A simple LSTM for non-literal token and construction-level classification
Erik-Lân Do Dinh | Steffen Eger | Iryna Gurevych
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In this paper, we tackle four different tasks of non-literal language classification: token and construction level metaphor detection, classification of idiomatic use of infinitive-verb compounds, and classification of non-literal particle verbs. One of the tasks operates on the token level, while the three other tasks classify constructions such as “hot topic” or “stehen lassen” (“to allow sth. to stand” vs. “to abandon so.”). The two metaphor detection tasks are in English, while the two non-literal language detection tasks are in German. We propose a simple context-encoding LSTM model and show that it outperforms the state-of-the-art on two tasks. Additionally, we experiment with different embeddings for the token level metaphor detection task and find that 1) their performance varies according to the genre, and 2) word2vec embeddings perform best on 3 out of 4 genres, despite being one of the simplest tested model. In summary, we present a large-scale analysis of a neural model for non-literal language classification (i) at different granularities, (ii) in different languages, (iii) over different non-literal language phenomena.

pdf bib
Frame- and Entity-Based Knowledge for Common-Sense Argumentative Reasoning
Teresa Botschen | Daniil Sorokin | Iryna Gurevych
Proceedings of the 5th Workshop on Argument Mining

Common-sense argumentative reasoning is a challenging task that requires holistic understanding of the argumentation where external knowledge about the world is hypothesized to play a key role. We explore the idea of using event knowledge about prototypical situations from FrameNet and fact knowledge about concrete entities from Wikidata to solve the task. We find that both resources can contribute to an improvement over the non-enriched approach and point out two persisting challenges: first, integration of many annotations of the same type, and second, fusion of complementary annotations. After our explorations, we question the key role of external world knowledge with respect to the argumentative reasoning task and rather point towards a logic-based analysis of the chain of reasoning.

pdf bib
PD3: Better Low-Resource Cross-Lingual Transfer By Combining Direct Transfer and Annotation Projection
Steffen Eger | Andreas Rücklé | Iryna Gurevych
Proceedings of the 5th Workshop on Argument Mining

We consider unsupervised cross-lingual transfer on two tasks, viz., sentence-level argumentation mining and standard POS tagging. We combine direct transfer using bilingual embeddings with annotation projection, which projects labels across unlabeled parallel data. We do so by either merging respective source and target language datasets or alternatively by using multi-task learning. Our combination strategy considerably improves upon both direct transfer and projection with few available parallel sentences, the most realistic scenario for many low-resource target languages.

pdf bib
Cross-Lingual Argumentative Relation Identification: from English to Portuguese
Gil Rocha | Christian Stab | Henrique Lopes Cardoso | Iryna Gurevych
Proceedings of the 5th Workshop on Argument Mining

Argument mining aims to detect and identify argument structures from textual resources. In this paper, we aim to address the task of argumentative relation identification, a subtask of argument mining, for which several approaches have been recently proposed in a monolingual setting. To overcome the lack of annotated resources in less-resourced languages, we present the first attempt to address this subtask in a cross-lingual setting. We compare two standard strategies for cross-language learning, namely: projection and direct-transfer. Experimental results show that by using unsupervised language adaptation the proposed approaches perform at a competitive level when compared with fully-supervised in-language learning settings.

pdf bib
UKP-Athene: Multi-Sentence Textual Entailment for Claim Verification
Andreas Hanselowski | Hao Zhang | Zile Li | Daniil Sorokin | Benjamin Schiller | Claudia Schulz | Iryna Gurevych
Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)

The Fact Extraction and VERification (FEVER) shared task was launched to support the development of systems able to verify claims by extracting supporting or refuting facts from raw text. The shared task organizers provide a large-scale dataset for the consecutive steps involved in claim verification, in particular, document retrieval, fact extraction, and claim classification. In this paper, we present our claim verification pipeline approach, which, according to the preliminary results, scored third in the shared task, out of 23 competing systems. For the document retrieval, we implemented a new entity linking approach. In order to be able to rank candidate facts and classify a claim on the basis of several selected facts, we introduce two extensions to the Enhanced LSTM (ESIM).

pdf bib
E2E NLG Challenge: Neural Models vs. Templates
Yevgeniy Puzikov | Iryna Gurevych
Proceedings of the 11th International Conference on Natural Language Generation

E2E NLG Challenge is a shared task on generating restaurant descriptions from sets of key-value pairs. This paper describes the results of our participation in the challenge. We develop a simple, yet effective neural encoder-decoder model which produces fluent restaurant descriptions and outperforms a strong baseline. We further analyze the data provided by the organizers and conclude that the task can also be approached with a template-based model developed in just a few hours.

pdf bib
Corpus-Driven Thematic Hierarchy Induction
Ilia Kuznetsov | Iryna Gurevych
Proceedings of the 22nd Conference on Computational Natural Language Learning

Thematic role hierarchy is a widely used linguistic tool to describe interactions between semantic roles and their syntactic realizations. Despite decades of dedicated research and numerous thematic hierarchy suggestions in the literature, this concept has not been used in NLP so far due to incompatibility and limited scope of existing hierarchies. We introduce an empirical framework for thematic hierarchy induction and evaluate several role ranking strategies on English and German full-text corpus data. We hypothesize that global thematic hierarchy induction is feasible, that a hierarchy can be induced from just fractions of training data and that resulting hierarchies apply cross-lingually. We evaluate these assumptions empirically.

pdf bib
Event Time Extraction with a Decision Tree of Neural Classifiers
Nils Reimers | Nazanin Dehghani | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 6

Extracting the information from text when an event happened is challenging. Documents do not only report on current events, but also on past events as well as on future events. Often, the relevant time information for an event is scattered across the document. In this paper we present a novel method to automatically anchor events in time. To our knowledge it is the first approach that takes temporal information from the complete document into account. We created a decision tree that applies neural network based classifiers at its nodes. We use this tree to incrementally infer, in a stepwise manner, at which time frame an event happened. We evaluate the approach on the TimeBank-EventTime Corpus (Reimers et al., 2016) achieving an accuracy of 42.0% compared to an inter-annotator agreement (IAA) of 56.7%. For events that span over a single day we observe an accuracy improvement of 33.1 points compared to the state-of-the-art CAEVO system (Chambers et al., 2014). Without retraining, we apply this model to the SemEval-2015 Task 4 on automatic timeline generation and achieve an improvement of 4.01 points F1-score compared to the state-of-the-art. Our code is publically available.

pdf bib
Finding Convincing Arguments Using Scalable Bayesian Preference Learning
Edwin Simpson | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 6

We introduce a scalable Bayesian preference learning method for identifying convincing arguments in the absence of gold-standard ratings or rankings. In contrast to previous work, we avoid the need for separate methods to perform quality control on training data, predict rankings and perform pairwise classification. Bayesian approaches are an effective solution when faced with sparse or noisy training data, but have not previously been used to identify convincing arguments. One issue is scalability, which we address by developing a stochastic variational inference method for Gaussian process (GP) preference learning. We show how our method can be applied to predict argument convincingness from crowdsourced data, outperforming the previous state-of-the-art, particularly when trained with small amounts of unreliable data. We demonstrate how the Bayesian approach enables more effective active learning, thereby reducing the amount of data required to identify convincing arguments for new users and domains. While word embeddings are principally used with neural networks, our results show that word embeddings in combination with linguistic features also benefit GPs when predicting argument convincingness.

pdf bib
A Legal Perspective on Training Models for Natural Language Processing
Richard Eckart de Castilho | Giulia Dore | Thomas Margoni | Penny Labropoulou | Iryna Gurevych
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Adapting Serious Game for Fallacious Argumentation to German: Pitfalls, Insights, and Best Practices
Ivan Habernal | Patrick Pauli | Iryna Gurevych
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Argumentation Mining in User-Generated Web Discourse
Ivan Habernal | Iryna Gurevych
Computational Linguistics, Volume 43, Issue 1 - April 2017

The goal of argumentation mining, an evolving research field in computational linguistics, is to design methods capable of analyzing people’s argumentation. In this article, we go beyond the state of the art in several ways. (i) We deal with actual Web data and take up the challenges given by the variety of registers, multiple domains, and unrestricted noisy user-generated Web discourse. (ii) We bridge the gap between normative argumentation theories and argumentation phenomena encountered in actual data by adapting an argumentation model tested in an extensive annotation study. (iii) We create a new gold standard corpus (90k tokens in 340 documents) and experiment with several machine learning methods to identify argument components. We offer the data, source codes, and annotation guidelines to the community under free licenses. Our findings show that argumentation mining in user-generated Web discourse is a feasible but challenging task.

pdf bib
Parsing Argumentation Structures in Persuasive Essays
Christian Stab | Iryna Gurevych
Computational Linguistics, Volume 43, Issue 3 - September 2017

In this article, we present a novel approach for parsing argumentation structures. We identify argument components using sequence labeling at the token level and apply a new joint model for detecting argumentation structures. The proposed model globally optimizes argument component types and argumentative relations using Integer Linear Programming. We show that our model significantly outperforms challenging heuristic baselines on two different types of discourse. Moreover, we introduce a novel corpus of persuasive essays annotated with argumentation structures. We show that our annotation scheme and annotation guidelines successfully guide human annotators to substantial agreement.

pdf bib
Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging
Nils Reimers | Iryna Gurevych
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this paper we show that reporting a single performance score is insufficient to compare non-deterministic approaches. We demonstrate for common sequence tagging tasks that the seed value for the random number generator can result in statistically significant (p < 10-4) differences for state-of-the-art systems. For two recent systems for NER, we observe an absolute difference of one percentage point F₁-score depending on the selected seed value, making these systems perceived either as state-of-the-art or mediocre. Instead of publishing and reporting single performance scores, we propose to compare score distributions based on multiple executions. Based on the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we present network architectures that produce both superior performance as well as are more stable with respect to the remaining hyperparameters.

pdf bib
Context-Aware Representations for Knowledge Base Relation Extraction
Daniil Sorokin | Iryna Gurevych
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We demonstrate that for sentence-level relation extraction it is beneficial to consider other relations in the sentential context while predicting the target relation. Our architecture uses an LSTM-based encoder to jointly learn representations for all relations in a single sentence. We combine the context representations with an attention mechanism to make the final prediction. We use the Wikidata knowledge base to construct a dataset of multiple relations per sentence and to evaluate our approach. Compared to a baseline system, our method results in an average error reduction of 24 on a held-out set of relations. The code and the dataset to replicate the experiments are made available at https://github.com/ukplab/.

pdf bib
What is the Essence of a Claim? Cross-Domain Claim Identification
Johannes Daxenberger | Steffen Eger | Ivan Habernal | Christian Stab | Iryna Gurevych
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Argument mining has become a popular research area in NLP. It typically includes the identification of argumentative components, e.g. claims, as the central component of an argument. We perform a qualitative analysis across six different datasets and show that these appear to conceptualize claims quite differently. To learn about the consequences of such different conceptualizations of claim for practical applications, we carried out extensive experiments using state-of-the-art feature-rich and deep learning systems, to identify claims in a cross-domain fashion. While the divergent conceptualization of claims in different datasets is indeed harmful to cross-domain classification, we show that there are shared properties on the lexical level as well as system configurations that can help to overcome these gaps.

pdf bib
Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps
Tobias Falke | Iryna Gurevych
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Concept maps can be used to concisely represent important information and bring structure into large document collections. Therefore, we study a variant of multi-document summarization that produces summaries in the form of concept maps. However, suitable evaluation datasets for this task are currently missing. To close this gap, we present a newly created corpus of concept maps that summarize heterogeneous collections of web documents on educational topics. It was created using a novel crowdsourcing approach that allows us to efficiently determine important elements in large document collections. We release the corpus along with a baseline system and proposed evaluation protocol to enable further research on this variant of summarization.

pdf bib
Argotario: Computational Argumentation Meets Serious Games
Ivan Habernal | Raffael Hannemann | Christian Pollak | Christopher Klamm | Patrick Pauli | Iryna Gurevych
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

An important skill in critical thinking and argumentation is the ability to spot and recognize fallacies. Fallacious arguments, omnipresent in argumentative discourse, can be deceptive, manipulative, or simply leading to ‘wrong moves’ in a discussion. Despite their importance, argumentation scholars and NLP researchers with focus on argumentation quality have not yet investigated fallacies empirically. The nonexistence of resources dealing with fallacious argumentation calls for scalable approaches to data acquisition and annotation, for which the serious games methodology offers an appealing, yet unexplored, alternative. We present Argotario, a serious game that deals with fallacies in everyday argumentation. Argotario is a multilingual, open-source, platform-independent application with strong educational aspects, accessible at www.argotario.net.

pdf bib
GraphDocExplore: A Framework for the Experimental Comparison of Graph-based Document Exploration Techniques
Tobias Falke | Iryna Gurevych
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Graphs have long been proposed as a tool to browse and navigate in a collection of documents in order to support exploratory search. Many techniques to automatically extract different types of graphs, showing for example entities or concepts and different relationships between them, have been suggested. While experimental evidence that they are indeed helpful exists for some of them, it is largely unknown which type of graph is most helpful for a specific exploratory task. However, carrying out experimental comparisons with human subjects is challenging and time-consuming. Towards this end, we present the GraphDocExplore framework. It provides an intuitive web interface for graph-based document exploration that is optimized for experimental user studies. Through a generic graph interface, different methods to extract graphs from text can be plugged into the system. Hence, they can be compared at minimal implementation effort in an environment that ensures controlled comparisons. The system is publicly available under an open-source license.

pdf bib
Concept-Map-Based Multi-Document Summarization using Concept Coreference Resolution and Global Importance Optimization
Tobias Falke | Christian M. Meyer | Iryna Gurevych
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Concept-map-based multi-document summarization is a variant of traditional summarization that produces structured summaries in the form of concept maps. In this work, we propose a new model for the task that addresses several issues in previous methods. It learns to identify and merge coreferent concepts to reduce redundancy, determines their importance with a strong supervised model and finds an optimal summary concept map via integer linear programming. It is also computationally more efficient than previous methods, allowing us to summarize larger document sets. We evaluate the model on two datasets, finding that it outperforms several approaches from previous work.

pdf bib
Neural End-to-End Learning for Computational Argumentation Mining
Steffen Eger | Johannes Daxenberger | Iryna Gurevych
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate neural techniques for end-to-end computational argumentation mining (AM). We frame AM both as a token-based dependency parsing and as a token-based sequence tagging problem, including a multi-task learning setup. Contrary to models that operate on the argument component level, we find that framing AM as dependency parsing leads to subpar performance results. In contrast, less complex (local) tagging models based on BiLSTMs perform robustly across classification scenarios, being able to catch long-range dependencies inherent to the AM problem. Moreover, we find that jointly learning ‘natural’ subtasks, in a multi-task learning setup, improves performance.

pdf bib
Argumentation Quality Assessment: Theory vs. Practice
Henning Wachsmuth | Nona Naderi | Ivan Habernal | Yufang Hou | Graeme Hirst | Iryna Gurevych | Benno Stein
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Argumentation quality is viewed differently in argumentation theory and in practical assessment approaches. This paper studies to what extent the views match empirically. We find that most observations on quality phrased spontaneously are in fact adequately represented by theory. Even more, relative comparisons of arguments in practice correlate with absolute quality ratings based on theory. Our results clarify how the two views can learn from each other.

pdf bib
Integrating Deep Linguistic Features in Factuality Prediction over Unified Datasets
Gabriel Stanovsky | Judith Eckle-Kohler | Yevgeniy Puzikov | Ido Dagan | Iryna Gurevych
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Previous models for the assessment of commitment towards a predicate in a sentence (also known as factuality prediction) were trained and tested against a specific annotated dataset, subsequently limiting the generality of their results. In this work we propose an intuitive method for mapping three previously annotated corpora onto a single factuality scale, thereby enabling models to be tested across these corpora. In addition, we design a novel model for factuality prediction by first extending a previous rule-based factuality prediction system and applying it over an abstraction of dependency trees, and then using the output of this system in a supervised classifier. We show that this model outperforms previous methods on all three datasets. We make both the unified factuality corpus and our new model publicly available.

pdf bib
End-to-End Non-Factoid Question Answering with an Interactive Visualization of Neural Attention Weights
Andreas Rücklé | Iryna Gurevych
Proceedings of ACL 2017, System Demonstrations

pdf bib
SemEval-2017 Task 7: Detection and Interpretation of English Puns
Tristan Miller | Christian Hempelmann | Iryna Gurevych
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

A pun is a form of wordplay in which a word suggests two or more meanings by exploiting polysemy, homonymy, or phonological similarity to another word, for an intended humorous or rhetorical effect. Though a recurrent and expected feature in many discourse types, puns stymie traditional approaches to computational lexical semantics because they violate their one-sense-per-context assumption. This paper describes the first competitive evaluation for the automatic detection, location, and interpretation of puns. We describe the motivation for these tasks, the evaluation methods, and the manually annotated data set. Finally, we present an overview and discussion of the participating systems’ methodologies, resources, and results.

pdf bib
EELECTION at SemEval-2017 Task 10: Ensemble of nEural Learners for kEyphrase ClassificaTION
Steffen Eger | Erik-Lân Do Dinh | Ilia Kuznetsov | Masoud Kiaeeha | Iryna Gurevych
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our approach to the SemEval 2017 Task 10: Extracting Keyphrases and Relations from Scientific Publications, specifically to Subtask (B): Classification of identified keyphrases. We explored three different deep learning approaches: a character-level convolutional neural network (CNN), a stacked learner with an MLP meta-classifier, and an attention based Bi-LSTM. From these approaches, we created an ensemble of differently hyper-parameterized systems, achieving a micro-F1-score of 0.63 on the test data. Our approach ranks 2nd (score of 1st placed system: 0.64) out of four according to this official score. However, we erroneously trained 2 out of 3 neural nets (the stacker and the CNN) on only roughly 15% of the full data, namely, the original development set. When trained on the full data (training+development), our ensemble has a micro-F1-score of 0.69. Our code is available from https://github.com/UKPLab/semeval2017-scienceie.

pdf bib
Real-Time News Summarization with Adaptation to Media Attention
Andreas Rücklé | Iryna Gurevych
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Real-time summarization of news events (RTS) allows persons to stay up-to-date on important topics that develop over time. With the occurrence of major sub-events, media attention increases and a large number of news articles are published. We propose a summarization approach that detects such changes and selects a suitable summarization configuration at run-time. In particular, at times with high media attention, our approach exploits the redundancy in content to produce a more precise summary and avoid emitting redundant information. We find that our approach significantly outperforms a strong non-adaptive RTS baseline in terms of the emitted summary updates and achieves the best results on a recent web-scale dataset. It can successfully be applied to a different real-world dataset without requiring additional modifications.

pdf bib
Assessing SRL Frameworks with Automatic Training Data Expansion
Silvana Hartmann | Éva Mújdricza-Maydt | Ilia Kuznetsov | Iryna Gurevych | Anette Frank
Proceedings of the 11th Linguistic Annotation Workshop

We present the first experiment-based study that explicitly contrasts the three major semantic role labeling frameworks. As a prerequisite, we create a dataset labeled with parallel FrameNet-, PropBank-, and VerbNet-style labels for German. We train a state-of-the-art SRL tool for German for the different annotation styles and provide a comparative analysis across frameworks. We further explore the behavior of the frameworks with automatic training data generation. VerbNet provides larger semantic expressivity than PropBank, and we find that its generalization capacity approaches PropBank in SRL training, but it benefits less from training data expansion than the sparse-data affected FrameNet.

pdf bib
A Consolidated Open Knowledge Representation for Multiple Texts
Rachel Wities | Vered Shwartz | Gabriel Stanovsky | Meni Adler | Ori Shapira | Shyam Upadhyay | Dan Roth | Eugenio Martinez Camara | Iryna Gurevych | Ido Dagan
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

We propose to move from Open Information Extraction (OIE) ahead to Open Knowledge Representation (OKR), aiming to represent information conveyed jointly in a set of texts in an open text-based manner. We do so by consolidating OIE extractions using entity and predicate coreference, while modeling information containment between coreferring elements via lexical entailment. We suggest that generating OKR structures can be a useful step in the NLP pipeline, to give semantic applications an easy handle on consolidated information across multiple texts.

pdf bib
LSDSem 2017: Exploring Data Generation Methods for the Story Cloze Test
Michael Bugert | Yevgeniy Puzikov | Andreas Rücklé | Judith Eckle-Kohler | Teresa Martin | Eugenio Martínez-Cámara | Daniil Sorokin | Maxime Peyrard | Iryna Gurevych
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

The Story Cloze test is a recent effort in providing a common test scenario for text understanding systems. As part of the LSDSem 2017 shared task, we present a system based on a deep learning architecture combined with a rich set of manually-crafted linguistic features. The system outperforms all known baselines for the task, suggesting that the chosen approach is promising. We additionally present two methods for generating further training data based on stories from the ROCStories corpus.

pdf bib
Distantly Supervised POS Tagging of Low-Resource Languages under Extreme Data Sparsity: The Case of Hittite
Maria Sukhareva | Francesco Fuscagni | Johannes Daxenberger | Susanne Görke | Doris Prechel | Iryna Gurevych
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper presents a statistical approach to automatic morphosyntactic annotation of Hittite transcripts. Hittite is an extinct Indo-European language using the cuneiform script. There are currently no morphosyntactic annotations available for Hittite, so we explored methods of distant supervision. The annotations were projected from parallel German translations of the Hittite texts. In order to reduce data sparsity, we applied stemming of German and Hittite texts. As there is no off-the-shelf Hittite stemmer, a stemmer for Hittite was developed for this purpose. The resulting annotation projections were used to train a POS tagger, achieving an accuracy of 69% on a test sample. To our knowledge, this is the first attempt of statistical POS tagging of a cuneiform language.

pdf bib
Prediction of Frame-to-Frame Relations in the FrameNet Hierarchy with Frame Embeddings
Teresa Botschen | Hatem Mousselly-Sergieh | Iryna Gurevych
Proceedings of the 2nd Workshop on Representation Learning for NLP

Automatic completion of frame-to-frame (F2F) relations in the FrameNet (FN) hierarchy has received little attention, although they incorporate meta-level commonsense knowledge and are used in downstream approaches. We address the problem of sparsely annotated F2F relations. First, we examine whether the manually defined F2F relations emerge from text by learning text-based frame embeddings. Our analysis reveals insights about the difficulty of reconstructing F2F relations purely from text. Second, we present different systems for predicting F2F relations; our best-performing one uses the FN hierarchy to train on and to ground embeddings in. A comparison of systems and embeddings exposes the crucial influence of knowledge-based embeddings to a system’s performance in predicting F2F relations.

pdf bib
Learning to Score System Summaries for Better Content Selection Evaluation.
Maxime Peyrard | Teresa Botschen | Iryna Gurevych
Proceedings of the Workshop on New Frontiers in Summarization

The evaluation of summaries is a challenging but crucial task of the summarization field. In this work, we propose to learn an automatic scoring metric based on the human judgements available as part of classical summarization datasets like TAC-2008 and TAC-2009. Any existing automatic scoring metrics can be included as features, the model learns the combination exhibiting the best correlation with human judgments. The reliability of the new metric is tested in a further manual evaluation where we ask humans to evaluate summaries covering the whole scoring spectrum of the metric. We release the trained metric as an open-source tool.

pdf bib
Proceedings of the 4th Workshop on Argument Mining
Ivan Habernal | Iryna Gurevych | Kevin Ashley | Claire Cardie | Nancy Green | Diane Litman | Georgios Petasis | Chris Reed | Noam Slonim | Vern Walker
Proceedings of the 4th Workshop on Argument Mining

pdf bib
Latest News in Computational Argumentation: Surfing on the Deep Learning Wave, Scuba Diving in the Abyss of Fundamental Questions
Iryna Gurevych
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Mining arguments from natural language texts, parsing argumentative structures, and assessing argument quality are among the recent challeng-es tackled in computational argumentation. While advanced deep learning models provide state-of-the-art performance in many of these tasks, much attention is also paid to the underly-ing fundamental questions. How are arguments expressed in natural language across genres and domains? What is the essence of an argument’s claim? Can we reliably annotate convincingness of an argument? How can we approach logic and common-sense reasoning in argumentation? This talk highlights some recent advances in computa-tional argumentation and shows why researchers must be both “surfers” and “scuba divers”.

pdf bib
Utilizing Automatic Predicate-Argument Analysis for Concept Map Mining
Tobias Falke | Iryna Gurevych
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

pdf bib
Neural Disambiguation of Causal Lexical Markers Based on Context
Eugenio Martínez-Cámara | Vered Shwartz | Iryna Gurevych | Ido Dagan
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

pdf bib
Representation Learning for Answer Selection with LSTM-Based Importance Weighting
Andreas Rücklé | Iryna Gurevych
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

pdf bib
Out-of-domain FrameNet Semantic Role Labeling
Silvana Hartmann | Ilia Kuznetsov | Teresa Martin | Iryna Gurevych
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Domain dependence of NLP systems is one of the major obstacles to their application in large-scale text analysis, also restricting the applicability of FrameNet semantic role labeling (SRL) systems. Yet, current FrameNet SRL systems are still only evaluated on a single in-domain test set. For the first time, we study the domain dependence of FrameNet SRL on a wide range of benchmark sets. We create a novel test set for FrameNet SRL based on user-generated web text and find that the major bottleneck for out-of-domain FrameNet SRL is the frame identification step. To address this problem, we develop a simple, yet efficient system based on distributed word representations. Our system closely approaches the state-of-the-art in-domain while outperforming the best available frame identification system out-of-domain. We publish our system and test data for research purposes.

pdf bib
Metaheuristic Approaches to Lexical Substitution and Simplification
Sallam Abualhaija | Tristan Miller | Judith Eckle-Kohler | Iryna Gurevych | Karl-Heinz Zimmermann
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In this paper, we propose using metaheuristics—in particular, simulated annealing and the new D-Bees algorithm—to solve word sense disambiguation as an optimization problem within a knowledge-based lexical substitution system. We are the first to perform such an extrinsic evaluation of metaheuristics, for which we use two standard lexical substitution datasets, one English and one German. We find that D-Bees has robust performance for both languages, and performs better than simulated annealing, though both achieve good results. Moreover, the D-Bees–based lexical substitution system outperforms state-of-the-art systems on several evaluation metrics. We also show that D-Bees achieves competitive performance in lexical simplification, a variant of lexical substitution.

pdf bib
Recognizing Insufficiently Supported Arguments in Argumentative Essays
Christian Stab | Iryna Gurevych
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In this paper, we propose a new task for assessing the quality of natural language arguments. The premises of a well-reasoned argument should provide enough evidence for accepting or rejecting its claim. Although this criterion, known as sufficiency, is widely adopted in argumentation theory, there are no empirical studies on its applicability to real arguments. In this work, we show that human annotators substantially agree on the sufficiency criterion and introduce a novel annotated corpus. Furthermore, we experiment with feature-rich SVMs and Convolutional Neural Networks and achieve 84% accuracy for automatically identifying insufficiently supported arguments. The final corpus as well as the annotation guideline are freely available for encouraging future research on argument quality.

pdf bib
A tool for extracting sense-disambiguated example sentences through user feedback
Beto Boullosa | Richard Eckart de Castilho | Alexander Geyken | Lothar Lemnitzer | Iryna Gurevych
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

This paper describes an application system aimed to help lexicographers in the extraction of example sentences for a given headword based on its different senses. The tool uses classification and clustering methods and incorporates user feedback to refine its results.

2016

pdf bib
Predicting the Spelling Difficulty of Words for Language Learners
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Token-Level Metaphor Detection using Neural Networks
Erik-Lân Do Dinh | Iryna Gurevych
Proceedings of the Fourth Workshop on Metaphor in NLP

pdf bib
Enriching Wikidata with Frame Semantics
Hatem Mousselly-Sergieh | Iryna Gurevych
Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf bib
Argumentation: Content, Structure, and Relationship with Essay Quality
Beata Beigman Klebanov | Christian Stab | Jill Burstein | Yi Song | Binod Gyawali | Iryna Gurevych
Proceedings of the Third Workshop on Argument Mining (ArgMining2016)

pdf bib
Recognizing the Absence of Opposing Arguments in Persuasive Essays
Christian Stab | Iryna Gurevych
Proceedings of the Third Workshop on Argument Mining (ArgMining2016)

pdf bib
A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures
Richard Eckart de Castilho | Éva Mújdricza-Maydt | Seid Muhie Yimam | Silvana Hartmann | Iryna Gurevych | Anette Frank | Chris Biemann
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

We introduce the third major release of WebAnno, a generic web-based annotation tool for distributed teams. New features in this release focus on semantic annotation tasks (e.g. semantic role labelling or event annotation) and allow the tight integration of semantic annotations with syntactic annotations. In particular, we introduce the concept of slot features, a novel constraint mechanism that allows modelling the interaction between semantic and syntactic annotations, as well as a new annotation user interface. The new features were developed and used in an annotation project for semantic roles on German texts. The paper briefly introduces this project and reports on experiences performing annotations with the new tool. On a comparative evaluation, our tool reaches significant speedups over WebAnno 2 for a semantic annotation task.

pdf bib
A domain-agnostic approach for opinion prediction on speech
Pedro Bispo Santos | Lisa Beinborn | Iryna Gurevych
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

We explore a domain-agnostic approach for analyzing speech with the goal of opinion prediction. We represent the speech signal by mel-frequency cepstral coefficients and apply long short-term memory neural networks to automatically learn temporal regularities in speech. In contrast to previous work, our approach does not require complex feature engineering and works without textual transcripts. As a consequence, it can easily be applied on various speech analysis tasks for different languages and the results show that it can nevertheless be competitive to the state-of-the-art in opinion prediction. In a detailed error analysis for opinion mining we find that our approach performs well in identifying speaker-specific characteristics, but should be combined with additional information if subtle differences in the linguistic content need to be identified.

pdf bib
Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity
Nils Reimers | Philip Beyer | Iryna Gurevych
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Semantic Textual Similarity (STS) is a foundational NLP task and can be used in a wide range of tasks. To determine the STS of two texts, hundreds of different STS systems exist, however, for an NLP system designer, it is hard to decide which system is the best one. To answer this question, an intrinsic evaluation of the STS systems is conducted by comparing the output of the system to human judgments on semantic similarity. The comparison is usually done using Pearson correlation. In this work, we show that relying on intrinsic evaluations with Pearson correlation can be misleading. In three common STS based tasks we could observe that the Pearson correlation was especially ill-suited to detect the best STS system for the task and other evaluation measures were much better suited. In this work we define how the validity of an intrinsic evaluation can be assessed and compare different intrinsic evaluation methods. Understanding of the properties of the targeted task is crucial and we propose a framework for conducting the intrinsic evaluation which takes the properties of the targeted task into account.

pdf bib
Bridging the gap between extractive and abstractive summaries: Creation and evaluation of coherent extracts from heterogeneous sources
Darina Benikova | Margot Mieskes | Christian M. Meyer | Iryna Gurevych
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Coherent extracts are a novel type of summary combining the advantages of manually created abstractive summaries, which are fluent but difficult to evaluate, and low-quality automatically created extractive summaries, which lack coherence and structure. We use a corpus of heterogeneous documents to address the issue that information seekers usually face – a variety of different types of information sources. We directly extract information from these, but minimally redact and meaningfully order it to form a coherent text. Our qualitative and quantitative evaluations show that quantitative results are not sufficient to judge the quality of a summary and that other quality criteria, such as coherence, should also be taken into account. We find that our manually created corpus is of high quality and that it has the potential to bridge the gap between reference corpora of abstracts and automatic methods producing extracts. Our corpus is available to the research community for further development.

pdf bib
Still not there? Comparing Traditional Sequence-to-Sequence Models to Encoder-Decoder Neural Networks on Monotone String Translation Tasks
Carsten Schnober | Steffen Eger | Erik-Lân Do Dinh | Iryna Gurevych
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We analyze the performance of encoder-decoder neural models and compare them with well-known established methods. The latter represent different classes of traditional approaches that are applied to the monotone sequence-to-sequence tasks OCR post-correction, spelling correction, grapheme-to-phoneme conversion, and lemmatization. Such tasks are of practical relevance for various higher-level research fields including digital humanities, automatic text correction, and speech recognition. We investigate how well generic deep-learning approaches adapt to these tasks, and how they perform in comparison with established and more specialized methods, including our own adaptation of pruned CRFs.

pdf bib
Semi-automatic Detection of Cross-lingual Marketing Blunders based on Pragmatic Label Propagation in Wiktionary
Christian M. Meyer | Judith Eckle-Kohler | Iryna Gurevych
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We introduce the task of detecting cross-lingual marketing blunders, which occur if a trade name resembles an inappropriate or negatively connotated word in a target language. To this end, we suggest a formal task definition and a semi-automatic method based the propagation of pragmatic labels from Wiktionary across sense-disambiguated translations. Our final tool assists users by providing clues for problematic names in any language, which we simulate in two experiments on detecting previously occurred marketing blunders and identifying relevant clues for established international brands. We conclude the paper with a suggested research roadmap for this new task. To initiate further research, we publish our online demo along with the source code and data at http://uby.ukp.informatik.tu-darmstadt.de/blunder/.

pdf bib
CNN- and LSTM-based Claim Classification in Online User Comments
Chinnappa Guggilla | Tristan Miller | Iryna Gurevych
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

When processing arguments in online user interactive discourse, it is often necessary to determine their bases of support. In this paper, we describe a supervised approach, based on deep neural networks, for classifying the claims made in online arguments. We conduct experiments using convolutional neural networks (CNNs) and long short-term memory networks (LSTMs) on two claim data sets compiled from online user comments. Using different types of distributional word embeddings, but without incorporating any rich, expensive set of features, we achieve a significant improvement over the state of the art for one data set (which categorizes arguments as factual vs. emotional), and performance comparable to the state of the art on the other data set (which categorizes propositions according to their verifiability). Our approach has the advantages of using a generalized, simple, and effective methodology that works for claim categorization on different data sets and tasks.

pdf bib
Modeling Extractive Sentence Intersection via Subtree Entailment
Omer Levy | Ido Dagan | Gabriel Stanovsky | Judith Eckle-Kohler | Iryna Gurevych
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Sentence intersection captures the semantic overlap of two texts, generalizing over paradigms such as textual entailment and semantic text similarity. Despite its modeling power, it has received little attention because it is difficult for non-experts to annotate. We analyze 200 pairs of similar sentences and identify several underlying properties of sentence intersection. We leverage these insights to design an algorithm that decomposes the sentence intersection task into several simpler annotation tasks, facilitating the construction of a high quality dataset via crowdsourcing. We implement this approach and provide an annotated dataset of 1,764 sentence intersections.

pdf bib
Porting an Open Information Extraction System from English to German
Tobias Falke | Gabriel Stanovsky | Iryna Gurevych | Ido Dagan
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
What makes a convincing argument? Empirical analysis and detecting attributes of convincingness in Web argumentation
Ivan Habernal | Iryna Gurevych
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Which argument is more convincing? Analyzing and predicting convincingness of Web arguments using bidirectional LSTM
Ivan Habernal | Iryna Gurevych
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Supersense Embeddings: A Unified Model for Supersense Interpretation, Prediction, and Utilization
Lucie Flekova | Iryna Gurevych
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Temporal Anchoring of Events for the TimeBank Corpus
Nils Reimers | Nazanin Dehghani | Iryna Gurevych
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
MDSWriter: Annotation Tool for Creating High-Quality Multi-Document Summarization Corpora
Christian M. Meyer | Darina Benikova | Margot Mieskes | Iryna Gurevych
Proceedings of ACL-2016 System Demonstrations

bib
NLP Approaches to Computational Argumentation
Noam Slonim | Iryna Gurevych | Chris Reed | Benno Stein
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

Argumentation and debating represent primary intellectual activities of the human mind. People in all societies argue and debate, not only to convince others of their own opinions but also in order to explore the differences between multiple perspectives and conceptualizations, and to learn from this exploration. The process of reaching a resolution on controversial topics typically does not follow a simple sequence of purely logical steps. Rather it involves a wide variety of complex and interwoven actions. Presumably, pros and cons are identified, considered, and weighed, via cognitive processes that often involve persuasion and emotions, which are inherently harder to formalize from a computational perspective.This wide range of conceptual capabilities and activities, have only in part been studied in fields like CL and NLP, and typically within relatively small sub-communities that overlap the ACL audience. The new field of Computational Argumentation has very recently seen significant expansion within the CL and NLP community as new techniques and datasets start to become available, allowing for the first time investigation of the computational aspects of human argumentation in a holistic manner.The main goal of this tutorial would be to introduce this rapidly evolving field to the CL community. Specifically, we will aim to review recent advances in the field and to outline the challenging research questions - that are most relevant to the ACL audience - that naturally arise when trying to model human argumentation.We will further emphasize the practical value of this line of study, by considering real-world CL and NLP applications that are expected to emerge from this research, and to impact various industries, including legal, finance, healthcare, media, and education, to name just a few examples.The first part of the tutorial will provide introduction to the basics of argumentation and rhetoric. Next, we will cover fundamental analysis tasks in Computational Argumentation, including argumentation mining, revealing argument relations, assessing arguments quality, stance classification, polarity analysis, and more. After the coffee break, we will first review existing resources and recently introduced benchmark data. In the following part we will cover basic synthesis tasks in Computational Argumentation, including the relation to NLG and dialogue systems, and the evolving area of Debate Technologies, defined as technologies developed directly to enhance, support, and engage with human debating. Finally, we will present relevant demos, review potential applications, and discuss the future of this emerging field.

pdf bib
Sense-annotating a Lexical Substitution Data Set with Ubyline
Tristan Miller | Mohamed Khemakhem | Richard Eckart de Castilho | Iryna Gurevych
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe the construction of GLASS, a newly sense-annotated version of the German lexical substitution data set used at the GermEval 2015: LexSub shared task. Using the two annotation layers, we conduct the first known empirical study of the relationship between manually applied word senses and lexical substitutions. We find that synonymy and hypernymy/hyponymy are the only semantic relations directly linking targets to their substitutes, and that substitutes in the target’s hypernymy/hyponymy taxonomy closely align with the synonyms of a single GermaNet synset. Despite this, these substitutes account for a minority of those provided by the annotators. The results of our analysis accord with those of a previous study on English-language data (albeit with automatically induced word senses), leading us to suspect that the sense―substitution relations we discovered may be of a universal nature. We also tentatively conclude that relatively cheap lexical substitution annotations can be used as a knowledge source for automatic WSD. Also introduced in this paper is Ubyline, the web application used to produce the sense annotations. Ubyline presents an intuitive user interface optimized for annotating lexical sample data, and is readily adaptable to sense inventories other than GermaNet.

pdf bib
C4Corpus: Multilingual Web-size Corpus with Free License
Ivan Habernal | Omnia Zayed | Iryna Gurevych
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks. In this article we present the construction of 12 million-pages Web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs. Our highly-scalable Hadoop-based framework is able to process the full CommonCrawl corpus on 2000+ CPU cluster on the Amazon Elastic Map/Reduce infrastructure. The processing pipeline includes license identification, state-of-the-art boilerplate removal, exact duplicate and near-duplicate document removal, and language detection. The construction of the corpus is highly configurable and fully reproducible, and we provide both the framework (DKPro C4CorpusTools) and the resulting data (C4Corpus) to the research community.

pdf bib
Crowdsourcing a Large Dataset of Domain-Specific Context-Sensitive Semantic Verb Relations
Maria Sukhareva | Judith Eckle-Kohler | Ivan Habernal | Iryna Gurevych
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a new large dataset of 12403 context-sensitive verb relations manually annotated via crowdsourcing. These relations capture fine-grained semantic information between verb-centric propositions, such as temporal or entailment relations. We propose a novel semantic verb relation scheme and design a multi-step annotation approach for scaling-up the annotations using crowdsourcing. We employ several quality measures and report on agreement scores. The resulting dataset is available under a permissive CreativeCommons license at www.ukp.tu-darmstadt.de/data/verb-relations/. It represents a valuable resource for various applications, such as automatic information consolidation or automatic summarization.

pdf bib
Combining Semantic Annotation of Word Sense & Semantic Roles: A Novel Annotation Scheme for VerbNet Roles on German Language Data
Éva Mújdricza-Maydt | Silvana Hartmann | Iryna Gurevych | Anette Frank
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a VerbNet-based annotation scheme for semantic roles that we explore in an annotation study on German language data that combines word sense and semantic role annotation. We reannotate a substantial portion of the SALSA corpus with GermaNet senses and a revised scheme of VerbNet roles. We provide a detailed evaluation of the interaction between sense and role annotation. The resulting corpus will allow us to compare VerbNet role annotation for German to FrameNet and PropBank annotation by mapping to existing role annotations on the SALSA corpus. We publish the annotated corpus and detailed guidelines for the new role annotation scheme.

pdf bib
Generating Training Data for Semantic Role Labeling based on Label Transfer from Linked Lexical Resources
Silvana Hartmann | Judith Eckle-Kohler | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 4

We present a new approach for generating role-labeled training data using Linked Lexical Resources, i.e., integrated lexical resources that combine several resources (e.g., Word-Net, FrameNet, Wiktionary) by linking them on the sense or on the role level. Unlike resource-based supervision in relation extraction, we focus on complex linguistic annotations, more specifically FrameNet senses and roles. The automatically labeled training data (www.ukp.tu-darmstadt.de/knowledge-based-srl/) are evaluated on four corpora from different domains for the tasks of word sense disambiguation and semantic role classification. Results show that classifiers trained on our generated data equal those resulting from a standard supervised setting.

2015

pdf bib
Linking the Thoughts: Analysis of Argumentation Structures in Scientific Publications
Christian Kirschner | Judith Eckle-Kohler | Iryna Gurevych
Proceedings of the 2nd Workshop on Argumentation Mining

pdf bib
Candidate evaluation strategies for improved difficulty prediction of language tests
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Counting What Counts: Decompounding for Keyphrase Extraction
Nicolai Erbs | Pedro Bispo Santos | Torsten Zesch | Iryna Gurevych
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction

pdf bib
Automatic disambiguation of English puns
Tristan Miller | Iryna Gurevych
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
In-tool Learning for Selective Manual Annotation in Large Corpora
Erik-Lân Do Dinh | Richard Eckart de Castilho | Iryna Gurevych
Proceedings of ACL-IJCNLP 2015 System Demonstrations

pdf bib
Noise or additional information? Leveraging crowdsource annotation item agreement for natural language tasks.
Emily Jamison | Iryna Gurevych
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Personality Profiling of Fictional Characters using Sense-Level Links between Lexical Resources
Lucie Flekova | Iryna Gurevych
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Exploiting Debate Portals for Semi-Supervised Argumentation Mining in User-Generated Web Discourse
Ivan Habernal | Iryna Gurevych
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
On the Role of Discourse Markers for Discriminating Claims and Premises in Argumentative Discourse
Judith Eckle-Kohler | Roland Kluge | Iryna Gurevych
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
WordNet—Wikipedia—Wiktionary: Construction of a Three-way Alignment
Tristan Miller | Iryna Gurevych
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The coverage and quality of conceptual information contained in lexical semantic resources is crucial for many tasks in natural language processing. Automatic alignment of complementary resources is one way of improving this coverage and quality; however, past attempts have always been between pairs of specific resources. In this paper we establish some set-theoretic conventions for describing concepts and their alignments, and use them to describe a method for automatically constructing n-way alignments from arbitrary pairwise alignments. We apply this technique to the production of a three-way alignment from previously published WordNet-Wikipedia and WordNet-Wiktionary alignments. We then present a quantitative and informal qualitative analysis of the aligned resource. The three-way alignment was found to have greater coverage, an enriched sense representation, and coarser sense granularity than both the original resources and their pairwise alignments, though this came at the cost of accuracy. An evaluation of the induced word sense clusters in a word sense disambiguation task showed that they were no better than random clusters of equivalent granularity. However, use of the alignments to enrich a sense inventory with additional sense glosses did significantly improve the performance of a baseline knowledge-based WSD algorithm.

pdf bib
Lexical Substitution Dataset for German
Kostadin Cholakov | Chris Biemann | Judith Eckle-Kohler | Iryna Gurevych
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article describes a lexical substitution dataset for German. The whole dataset contains 2,040 sentences from the German Wikipedia, with one target word in each sentence. There are 51 target nouns, 51 adjectives, and 51 verbs randomly selected from 3 frequency groups based on the lemma frequency list of the German WaCKy corpus. 200 sentences have been annotated by 4 professional annotators and the remaining sentences by 1 professional annotator and 5 additional annotators who have been recruited via crowdsourcing. The resulting dataset can be used to evaluate not only lexical substitution systems, but also different sense inventories and word sense disambiguation systems.

pdf bib
A broad-coverage collection of portable NLP components for building shareable analysis pipelines
Richard Eckart de Castilho | Iryna Gurevych
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT

pdf bib
Automatically Detecting Corresponding Edit-Turn-Pairs in Wikipedia
Johannes Daxenberger | Iryna Gurevych
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
DKPro Keyphrases: Flexible and Reusable Keyphrase Extraction Experiments
Nicolai Erbs | Pedro Bispo Santos | Iryna Gurevych | Torsten Zesch
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data
Johannes Daxenberger | Oliver Ferschke | Iryna Gurevych | Torsten Zesch
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Automatic Annotation Suggestions and Custom Annotation Layers in WebAnno
Seid Muhie Yimam | Chris Biemann | Richard Eckart de Castilho | Iryna Gurevych
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Needle in a Haystack: Reducing the Costs of Annotating Rare-Class Instances in Imbalanced Datasets
Emily Jamison | Iryna Gurevych
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

pdf bib
Adjacency Pair Recognition in Wikipedia Discussions using Lexical Pairs
Emily Jamison | Iryna Gurevych
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

pdf bib
Sense and Similarity: A Study of Sense-level Similarity Measures
Nicolai Erbs | Iryna Gurevych | Torsten Zesch
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf bib
UKPDIPF: Lexical Semantic Approach to Sentiment Polarity Prediction in Twitter Data
Lucie Flekova | Oliver Ferschke | Iryna Gurevych
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
High Performance Word Sense Alignment by Joint Modeling of Sense Distance and Gloss Similarity
Michael Matuschek | Iryna Gurevych
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Annotating Argument Components and Relations in Persuasive Essays
Christian Stab | Iryna Gurevych
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
DKPro Agreement: An Open-Source Java Library for Measuring Inter-Rater Agreement
Christian M. Meyer | Margot Mieskes | Christian Stab | Iryna Gurevych
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

pdf bib
Predicting the Difficulty of Language Proficiency Tests
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 2

Language proficiency tests are used to evaluate and compare the progress of language learners. We present an approach for automatic difficulty prediction of C-tests that performs on par with human experts. On the basis of detailed analysis of newly collected data, we develop a model for C-test difficulty introducing four dimensions: solution difficulty, candidate ambiguity, inter-gap dependency, and paragraph difficulty. We show that cues from all four dimensions contribute to C-test difficulty.

pdf bib
Automated Verb Sense Labelling Based on Linked Lexical Resources
Kostadin Cholakov | Judith Eckle-Kohler | Iryna Gurevych
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Identifying Argumentative Discourse Structures in Persuasive Essays
Christian Stab | Iryna Gurevych
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Automatically Classifying Edit Categories in Wikipedia Revisions
Johannes Daxenberger | Iryna Gurevych
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
UKP-WSI: UKP Lab Semeval-2013 Task 11 System Description
Hans-Peter Zorn | Iryna Gurevych
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
UKP-BIU: Similarity and Entailment Metrics for Student Response Analysis
Omer Levy | Torsten Zesch | Ido Dagan | Iryna Gurevych
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
Hierarchy Identification for Automatically Generating Table-of-Contents
Nicolai Erbs | Iryna Gurevych | Torsten Zesch
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Headerless, Quoteless, but not Hopeless? Using Pairwise Email Classification to Disentangle Email Threads
Emily Jamison | Iryna Gurevych
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Supervised All-Words Lexical Substitution using Delexicalized Features
György Szarvas | Chris Biemann | Iryna Gurevych
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Dijkstra-WSA: A Graph-Based Approach to Word Sense Alignment
Michael Matuschek | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 1

In this paper, we present Dijkstra-WSA, a novel graph-based algorithm for word sense alignment. We evaluate it on four different pairs of lexical-semantic resources with different characteristics (WordNet-OmegaWiki, WordNet-Wiktionary, GermaNet-Wiktionary and WordNet-Wikipedia) and show that it achieves competitive performance on 3 out of 4 datasets. Dijkstra-WSA outperforms the state of the art on every dataset if it is combined with a back-off based on gloss similarity. We also demonstrate that Dijkstra-WSA is not only flexibly applicable to different resources but also highly parameterizable to optimize for precision or recall.

pdf bib
Cognate Production using Character-based Machine Translation
Lisa Beinborn | Torsten Zesch | Iryna Gurevych
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Uncertainty Detection for Natural Language Watermarking
György Szarvas | Iryna Gurevych
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia
Oliver Ferschke | Iryna Gurevych | Marc Rittberger
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection
Silvana Hartmann | Iryna Gurevych
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Recognizing Partial Textual Entailment
Omer Levy | Torsten Zesch | Ido Dagan | Iryna Gurevych
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations
Seid Muhie Yimam | Iryna Gurevych | Richard Eckart de Castilho | Chris Biemann
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation
Tristan Miller | Nicolai Erbs | Hans-Peter Zorn | Torsten Zesch | Iryna Gurevych
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
DKPro Similarity: An Open Source Framework for Text Similarity
Daniel Bär | Torsten Zesch | Iryna Gurevych
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

pdf bib
CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora
Richard Eckart de Castilho | Sabine Bartsch | Iryna Gurevych
Proceedings of the ACL 2012 System Demonstrations

pdf bib
Cross-Genre and Cross-Domain Detection of Semantic Uncertainty
György Szarvas | Veronika Vincze | Richárd Farkas | György Móra | Iryna Gurevych
Computational Linguistics, Volume 38, Issue 2 - June 2012

pdf bib
Proceedings of the 3rd Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources and their Applications to NLP
Iryna Gurevych | Nicoletta Calzolari Zamorani | Jungi Kim
Proceedings of the 3rd Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources and their Applications to NLP

pdf bib
Text Reuse Detection using a Composition of Text Similarity Measures
Daniel Bär | Torsten Zesch | Iryna Gurevych
Proceedings of COLING 2012

pdf bib
A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia Articles
Johannes Daxenberger | Iryna Gurevych
Proceedings of COLING 2012

pdf bib
To Exhibit is not to Loiter: A Multilingual, Sense-Disambiguated Wiktionary for Measuring Verb Similarity
Christian M. Meyer | Iryna Gurevych
Proceedings of COLING 2012

pdf bib
Using Distributional Similarity for Lexical Expansion in Knowledge-based Word Sense Disambiguation
Tristan Miller | Chris Biemann | Torsten Zesch | Iryna Gurevych
Proceedings of COLING 2012

pdf bib
Learning Semantics with Deep Belief Network for Cross-Language Information Retrieval
Jungi Kim | Jinseok Nam | Iryna Gurevych
Proceedings of COLING 2012: Posters

pdf bib
Subcat-LMF: Fleshing out a standardized format for subcategorization frame interoperability
Judith Eckle-Kohler | Iryna Gurevych
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF
Iryna Gurevych | Judith Eckle-Kohler | Silvana Hartmann | Michael Matuschek | Christian M. Meyer | Christian Wirth
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages
Oliver Ferschke | Iryna Gurevych | Yevgen Chebotar
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures
Daniel Bär | Chris Biemann | Iryna Gurevych | Torsten Zesch
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
UBY-LMF – A Uniform Model for Standardizing Heterogeneous Lexical-Semantic Resources in ISO-LMF
Judith Eckle-Kohler | Iryna Gurevych | Silvana Hartmann | Michael Matuschek | Christian M. Meyer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present UBY-LMF, an LMF-based model for large-scale, heterogeneous multilingual lexical-semantic resources (LSRs). UBY-LMF allows the standardization of LSRs down to a fine-grained level of lexical information by employing a large number of Data Categories from ISOCat. We evaluate UBY-LMF by converting nine LSRs in two languages to the corresponding format: the English WordNet, Wiktionary, Wikipedia, OmegaWiki, FrameNet and VerbNet and the German Wikipedia, Wiktionary and GermaNet. The resulting LSR, UBY (Gurevych et al., 2012), holds interoperable versions of all nine resources which can be queried by an easy to use public Java API. UBY-LMF covers a wide range of information types from expert-constructed and collaboratively constructed resources for English and German, also including links between different resources at the word sense level. It is designed to accommodate further resources and languages as well as automatically mined lexical-semantic knowledge.

pdf bib
The Open Linguistics Working Group
Christian Chiarcos | Sebastian Hellmann | Sebastian Nordhoff | Steven Moran | Richard Littauer | Judith Eckle-Kohler | Iryna Gurevych | Silvana Hartmann | Michael Matuschek | Christian M. Meyer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation (OKFN). The OWLG is an initiative concerned with linguistic data by scholars from diverse fields, including linguistics, NLP, and information science. The primary goal of the working group is to promote the idea of open linguistic resources, to develop means for their representation and to encourage the exchange of ideas across different disciplines. This paper summarizes the progress of the working group, goals that have been identified, problems that we are going to address, and recent activities and ongoing developments. Here, we put particular emphasis on the development of a Linked Open Data (sub-)cloud of linguistic resources that is currently being pursued by several OWLG members.

2011

pdf bib
Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis
Daniel Bär | Nicolai Erbs | Torsten Zesch | Iryna Gurevych
Proceedings of the ACL-HLT 2011 System Demonstrations

pdf bib
Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia’s Edit History
Oliver Ferschke | Torsten Zesch | Iryna Gurevych
Proceedings of the ACL-HLT 2011 System Demonstrations

pdf bib
What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for Increased Domain Coverage
Christian M. Meyer | Iryna Gurevych
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
A Reflective View on Text Similarity
Daniel Bär | Torsten Zesch | Iryna Gurevych
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
The People’s Web meets Linguistic Knowledge: Automatic Sense Alignment of Wikipedia and WordNet
Elisabeth Niemann | Iryna Gurevych
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)

pdf bib
A Study of Sense-Disambiguated Networks Induced from Folksonomies
Hans-Peter Zorn | Iryna Gurevych
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

2010

pdf bib
The More the Better? Assessing the Influence of Wikipedia’s Growth on Semantic Relatedness Measures
Torsten Zesch | Iryna Gurevych
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Wikipedia has been used as a knowledge source in many areas of natural language processing. As most studies only use a certain Wikipedia snapshot, the influence of Wikipedia’s massive growth on the results is largely unknown. For the first time, we perform an in-depth analysis of this influence using semantic relatedness as an example application that tests a wide range of Wikipedia’s properties. We find that the growth of Wikipedia has almost no effect on the correlation of semantic relatedness measures with human judgments, while the coverage steadily increases.

pdf bib
Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields
Niklas Jakob | Iryna Gurevych
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Monolingual Tree-based Translation Model for Sentence Simplification
Zhemin Zhu | Delphine Bernhard | Iryna Gurevych
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
TUD: Semantic Relatedness for Relation Classification
György Szarvas | Iryna Gurevych
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Sentence and Expression Level Annotation of Opinions in User-Generated Discourse
Cigdem Toprak | Niklas Jakob | Iryna Gurevych
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Using Anaphora Resolution to Improve Opinion Target Identification in Movie Reviews
Niklas Jakob | Iryna Gurevych
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
Proceedings of the 2nd Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources
Iryna Gurevych | Torsten Zesch
Proceedings of the 2nd Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources

2009

pdf bib
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding
Delphine Bernhard | Iryna Gurevych
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
A Study on the Semantic Relatedness of Query and Document Terms in Information Retrieval
Christof Müller | Iryna Gurevych
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)
Iryna Gurevych | Torsten Zesch
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)

pdf bib
Approximate Matching for Evaluating Keyphrase Extraction
Torsten Zesch | Iryna Gurevych
Proceedings of the International Conference RANLP-2009

2008

pdf bib
Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary
Torsten Zesch | Christof Müller | Iryna Gurevych
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Recently, collaboratively constructed resources such as Wikipedia and Wiktionary have been discovered as valuable lexical semantic knowledge bases with a high potential in diverse Natural Language Processing (NLP) tasks. Collaborative knowledge bases however significantly differ from traditional linguistic knowledge bases in various respects, and this constitutes both an asset and an impediment for research in NLP. This paper addresses one such major impediment, namely the lack of suitable programmatic access mechanisms to the knowledge stored in these large semantic knowledge bases. We present two application programming interfaces for Wikipedia and Wiktionary which are especially designed for mining the rich lexical semantic information dispersed in the knowledge bases, and provide efficient and structured access to the available knowledge. As we believe them to be of general interest to the NLP community, we have made them freely available for research purposes.

pdf bib
Answering Learners’ Questions by Retrieving Question Paraphrases from Social Q&A Sites
Delphine Bernhard | Iryna Gurevych
Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Coling 2008: Educational Natural Language Processing - Tutorial notes
Iryna Gurevych | Delphine Bernhard
Coling 2008: Educational Natural Language Processing - Tutorial notes

pdf bib
ENLP Tutorial Notes – Slides
Iryna Gurevych | Delphine Bernhard
Coling 2008: Educational Natural Language Processing - Tutorial notes

pdf bib
ENLP Tutorial Notes – References
Iryna Gurevych | Delphine Bernhard
Coling 2008: Educational Natural Language Processing - Tutorial notes

2007

pdf bib
What to be? - Electronic Career Guidance Based on Semantic Relatedness
Iryna Gurevych | Christof Müller | Torsten Zesch
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
Automatically Assessing the Post Quality in Online Discussions on Software
Markus Weimer | Iryna Gurevych | Max Mühlhäuser
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf bib
Comparing Wikipedia and German Wordnet by Evaluating Semantic Relatedness on Multiple Datasets
Torsten Zesch | Iryna Gurevych | Max Mühlhäuser
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
Cross-Lingual Distributional Profiles of Concepts for Measuring Semantic Distance
Saif Mohammad | Iryna Gurevych | Graeme Hirst | Torsten Zesch
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
Analysis of the Wikipedia Category Graph for NLP Applications
Torsten Zesch | Iryna Gurevych
Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing

2006

pdf bib
Automatically Creating Datasets for Measures of Semantic Relatedness
Torsten Zesch | Iryna Gurevych
Proceedings of the Workshop on Linguistic Distances

2005

pdf bib
Accessing GermaNet Data and Computing Semantic Relatedness
Iryna Gurevych | Hendrik Niederlich
Proceedings of the ACL Interactive Poster and Demonstration Sessions

pdf bib
Using the Structure of a Conceptual Network in Computing Semantic Relatedness
Iryna Gurevych
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Computing Semantic Relatedness in German with Revised Information Content Metrics
Iryna Gurevych | Hendrik Niederlich
Proceedings of OntoLex 2005 - Ontologies and Lexical Resources

2004

pdf bib
Semantic Similarity Applied to Spoken Dialogue Summarization
Iryna Gurevych | Michael Strube
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Assigning Domains to Speech Recognition Hypotheses
Klaus Rüggenmann | Iryna Gurevych
Proceedings of the HLT-NAACL 2004 Workshop on Spoken Language Understanding for Conversational Systems and Higher Level Linguistic Information for Speech Processing

2003

pdf bib
Semantic Coherence Scoring Using an Ontology
Iryna Gurevych | Rainer Malaka | Robert Porzel | Hans-Peter Zorn
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Automatic Creation of Interface Specifications from Ontologies
Iryna Gurevych | Stefan Merten | Robert Porzel
Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS)

pdf bib
Less is More: Using a single knowledge representation in dialogue systems
Iryna Gurevych | Robert Porzel | Elena Slinko | Norbert Pfleger | Jan Alexandersson | Stefan Merten
Proceedings of the HLT-NAACL 2003 Workshop on Text Meaning

pdf bib
Ontology-based Contextual Coherence Scoring
Robert Porzel | Iryna Gurevych | Christof E. Müller
Proceedings of the Fourth SIGdial Workshop of Discourse and Dialogue

2002

pdf bib
Annotating the Semantic Consistency of Speech Recognition Hypotheses
Iryna Gurevych | Robert Porzel | Michael Strube
Proceedings of the Third SIGdial Workshop on Discourse and Dialogue

pdf bib
Towards Context-adaptive Utterance Interpretation
Robert Porzel | Iryna Gurevych
Proceedings of the Third SIGdial Workshop on Discourse and Dialogue

Search
Co-authors