Michael Roth

2024

pdf abs
A Diachronic Analysis of Gender-Neutral Language on wikiHow
Katharina Suhr | Michael Roth
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

As a large how-to website, wikiHow’s mission is to empower every person on the planet to learn how to do anything. An important part of including everyone also linguistically is the use of gender-neutral language. In this short paper, we study in how far articles from wikiHow fulfill this criterion based on manual annotation and automatic classification. In particular, we employ a classifier to analyze how the use of gender-neutral language has developed over time. Our results show that although about 75% of all articles on wikiHow were written in a gender-neutral way from the outset, revisions have a higher tendency to add gender-specific language than to change it to inclusive wording.

2023

pdf abs
How-to Guides for Specific Audiences: A Corpus and Initial Findings
Nicola Fanton | Agnieszka Falenska | Michael Roth
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Instructional texts for specific target groups should ideally take into account the prior knowledge and needs of the readers in order to guide them efficiently to their desired goals. However, targeting specific groups also carries the risk of reflecting disparate social norms and subtle stereotypes. In this paper, we investigate the extent to which how-to guides from one particular platform, wikiHow, differ in practice depending on the intended audience. We conduct two case studies in which we examine qualitative features of texts written for specific audiences. In a generalization study, we investigate which differences can also be systematically demonstrated using computational methods. The results of our studies show that guides from wikiHow, like other text genres, are subject to subtle biases. We aim to raise awareness of these inequalities as a first step to addressing them in future work.

2022

pdf abs
Toward Implicit Reference in Dialog: A Survey of Methods and Data
Lindsey Vanderlyn | Talita Anthonio | Daniel Ortega | Michael Roth | Ngoc Thang Vu
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Communicating efficiently in natural language requires that we often leave information implicit, especially in spontaneous speech. This frequently results in phenomena of incompleteness, such as omitted references, that pose challenges for language processing. In this survey paper, we review the state of the art in research regarding the automatic processing of such implicit references in dialog scenarios, discuss weaknesses with respect to inconsistencies in task definitions and terminologies, and outline directions for future work. Among others, these include a unification of existing tasks, addressing data scarcity, and taking into account model and annotator uncertainties.

pdf abs
Clarifying Implicit and Underspecified Phrases in Instructional Text
Talita Anthonio | Anna Sauer | Michael Roth
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Natural language inherently consists of implicit and underspecified phrases, which represent potential sources of misunderstanding. In this paper, we present a data set of such phrases in English from instructional texts together with multiple possible clarifications. Our data set, henceforth called CLAIRE, is based on a corpus of revision histories from wikiHow, from which we extract human clarifications that resolve an implicit or underspecified phrase. We show how language modeling can be used to generate alternate clarifications, which may or may not be compatible with the human clarification. Based on plausibility judgements for each clarification, we define the task of distinguishing between plausible and implausible clarifications. We provide several baseline models for this task and analyze to what extent different clarifications represent multiple readings as a first step to investigate misunderstandings caused by implicit/underspecified language in instructional texts.

pdf abs
SemEval-2022 Task 7: Identifying Plausible Clarifications of Implicit and Underspecified Phrases in Instructional Texts
Michael Roth | Talita Anthonio | Anna Sauer
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

We describe SemEval-2022 Task 7, a shared task on rating the plausibility of clarifications in instructional texts. The dataset for this task consists of manually clarified how-to guides for which we generated alternative clarifications and collected human plausibility judgements. The task of participating systems was to automatically determine the plausibility of a clarification in the respective context. In total, 21 participants took part in this task, with the best system achieving an accuracy of 68.9%. This report summarizes the results and findings from 8 teams and their system descriptions. Finally, we show in an additional evaluation that predictions by the top participating team make it possible to identify contexts with multiple plausible clarifications with an accuracy of 75.2%.

2021

pdf abs
Resolving Implicit References in Instructional Texts
Talita Anthonio | Michael Roth
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

The usage of (co-)referring expressions in discourse contributes to the coherence of a text. However, text comprehension can be difficult when referring expressions are non-verbalized and have to be resolved in the discourse context. In this paper, we propose a novel dataset of such implicit references, which we automatically derive from insertions of references in collaboratively edited how-to guides. Our dataset consists of 6,014 instances, making it one of the largest datasets of implicit references and a useful starting point to investigate misunderstandings caused by underspecified language. We test different methods for resolving implicit references in our dataset based on the Generative Pre-trained Transformer model (GPT) and compare them to heuristic baselines. Our experiments indicate that GPT can accurately resolve the majority of implicit references in our data. Finally, we investigate remaining errors and examine human preferences regarding different resolutions of an implicit reference given the discourse context.

pdf abs
A Computational Analysis of Vagueness in Revisions of Instructional Texts
Alok Debnath | Michael Roth
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

WikiHow is an open-domain repository of instructional articles for a variety of tasks, which can be revised by users. In this paper, we extract pairwise versions of an instruction before and after a revision was made. Starting from a noisy dataset of revision histories, we specifically extract and analyze edits that involve cases of vagueness in instructions. We further investigate the ability of a neural model to distinguish between two versions of an instruction in our data by adopting a pairwise ranking task from previous work and showing improvements over existing baselines.

pdf bib
Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language
Michael Roth | Reut Tsarfaty | Yoav Goldberg
Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language

pdf abs
UnImplicit Shared Task Report: Detecting Clarification Requirements in Instructional Text
Michael Roth | Talita Anthonio
Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language

This paper describes the data, task setup, and results of the shared task at the First Workshop on Understanding Implicit and Underspecified Language (UnImplicit). The task requires computational models to predict whether a sentence contains aspects of meaning that are contextually unspecified and thus require clarification. Two teams participated and the best scoring system achieved an accuracy of 68%.

2020

pdf abs
Predicting Coreference in Abstract Meaning Representations
Tatiana Anikina | Alexander Koller | Michael Roth
Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference

This work addresses coreference resolution in Abstract Meaning Representation (AMR) graphs, a popular formalism for semantic parsing. We evaluate several current coreference resolution techniques on a recently published AMR coreference corpus, establishing baselines for future work. We also demonstrate that coreference resolution can improve the accuracy of a state-of-the-art semantic parser on this corpus.

pdf abs
Towards Modeling Revision Requirements in wikiHow Instructions
Irshad Bhat | Talita Anthonio | Michael Roth
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

wikiHow is a resource of how-to guidesthat describe the steps necessary to accomplish a goal. Guides in this resource are regularly edited by a community of users, who try to improve instructions in terms of style, clarity and correctness. In this work, we test whether the need for such edits can be predicted automatically. For this task, we extend an existing resource of textual edits with a complementary set of approx. 4 million sentences that remain unedited over time and report on the outcome of two revision modeling experiments.

pdf abs
wikiHowToImprove: A Resource and Analyses on Edits in Instructional Texts
Talita Anthonio | Irshad Bhat | Michael Roth
Proceedings of the Twelfth Language Resources and Evaluation Conference

Instructional texts, such as articles in wikiHow, describe the actions necessary to accomplish a certain goal. In wikiHow and other resources, such instructions are subject to revision edits on a regular basis. Do these edits improve instructions only in terms of style and correctness, or do they provide clarifications necessary to follow the instructions and to accomplish the goal? We describe a resource and first studies towards answering this question. Specifically, we create wikiHowToImprove, a collection of revision histories for about 2.7 million sentences from about 246000 wikiHow articles. We describe human annotation studies on categorizing a subset of sentence-level edits and provide baseline models for the task of automatically distinguishing “older” from “newer” revisions of a sentence.

pdf abs
What Can We Learn from Noun Substitutions in Revision Histories?
Talita Anthonio | Michael Roth
Proceedings of the 28th International Conference on Computational Linguistics

In community-edited resources such as wikiHow, sentences are subject to revisions on a daily basis. Recent work has shown that resulting improvements over time can be modelled computationally, assuming that each revision contributes to the improvement. We take a closer look at a subset of such revisions, for which we attempt to improve a computational model and validate in how far the assumption that ‘revised means better’ actually holds. The subset of revisions considered here are noun substitutions, which often involve interesting semantic relations, including synonymy, antonymy and hypernymy. Despite the high semantic relatedness, we find that a supervised classifier can distinguish the revised version of a sentence from an original version with an accuracy close to 70%, when taking context into account. In a human annotation study, we observe that annotators identify the revised sentence as the ‘better version’ with similar performance. Our analysis reveals a fair agreement among annotators when a revision improves fluency. In contrast, noun substitutions that involve other lexical-semantic relationships are often perceived as being equally good or tend to cause disagreements. While these findings are also reflected in classification scores, a comparison of results shows that our model fails in cases where humans can resort to factual knowledge or intuitions about the required level of specificity.

2019

pdf abs
MCScript2.0: A Machine Comprehension Corpus Focused on Script Events and Participants
Simon Ostermann | Michael Roth | Manfred Pinkal
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

We introduce MCScript2.0, a machine comprehension corpus for the end-to-end evaluation of script knowledge. MCScript2.0 contains approx. 20,000 questions on approx. 3,500 texts, crowdsourced based on a new collection process that results in challenging questions. Half of the questions cannot be answered from the reading texts, but require the use of commonsense and, in particular, script knowledge. We give a thorough analysis of our corpus and show that while the task is not challenging to humans, existing machine comprehension models fail to perform well on the data, even if they make use of a commonsense knowledge base. The dataset is available at http://www.sfb1102.uni-saarland.de/?page_id=2582

pdf abs
Combining Discourse Markers and Cross-lingual Embeddings for Synonym–Antonym Classification
Michael Roth | Shyam Upadhyay
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

It is well-known that distributional semantic approaches have difficulty in distinguishing between synonyms and antonyms (Grefenstette, 1992; Padó and Lapata, 2003). Recent work has shown that supervision available in English for this task (e.g., lexical resources) can be transferred to other languages via cross-lingual word embeddings. However, this kind of transfer misses monolingual distributional information available in a target language, such as contrast relations that are indicative of antonymy (e.g. hot ... while ... cold). In this work, we improve the transfer by exploiting monolingual information, expressed in the form of co-occurrences with discourse markers that convey contrast. Our approach makes use of less than a dozen markers, which can easily be obtained for many languages. Compared to a baseline using only cross-lingual embeddings, we show absolute improvements of 4–10% F1-score in Vietnamese and Hindi.

pdf bib
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing
Simon Ostermann | Sheng Zhang | Michael Roth | Peter Clark
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

pdf abs
Commonsense Inference in Natural Language Processing (COIN) - Shared Task Report
Simon Ostermann | Sheng Zhang | Michael Roth | Peter Clark
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

This paper reports on the results of the shared tasks of the COIN workshop at EMNLP-IJCNLP 2019. The tasks consisted of two machine comprehension evaluations, each of which tested a system’s ability to answer questions/queries about a text. Both evaluations were designed such that systems need to exploit commonsense knowledge, for example, in the form of inferences over information that is available in the common ground but not necessarily mentioned in the text. A total of five participating teams submitted systems for the shared tasks, with the best submitted system achieving 90.6% accuracy and 83.7% F1-score on task 1 and task 2, respectively.

pdf abs
Detecting Everyday Scenarios in Narrative Texts
Lilian Diana Awuor Wanzare | Michael Roth | Manfred Pinkal
Proceedings of the Second Workshop on Storytelling

Script knowledge consists of detailed information on everyday activities. Such information is often taken for granted in text and needs to be inferred by readers. Therefore, script knowledge is a central component to language comprehension. Previous work on representing scripts is mostly based on extensive manual work or limited to scenarios that can be found with sufficient redundancy in large corpora. We introduce the task of scenario detection, in which we identify references to scripts. In this task, we address a wide range of different scripts (200 scenarios) and we attempt to identify all references to them in a collection of narrative texts. We present a first benchmark data set and a baseline model that tackles scenario detection using techniques from topic segmentation and text classification.

2018

pdf
MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge
Simon Ostermann | Ashutosh Modi | Michael Roth | Stefan Thater | Manfred Pinkal
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
SemEval-2018 Task 11: Machine Comprehension Using Commonsense Knowledge
Simon Ostermann | Michael Roth | Ashutosh Modi | Stefan Thater | Manfred Pinkal
Proceedings of the 12th International Workshop on Semantic Evaluation

This report summarizes the results of the SemEval 2018 task on machine comprehension using commonsense knowledge. For this machine comprehension task, we created a new corpus, MCScript. It contains a high number of questions that require commonsense knowledge for finding the correct answer. 11 teams from 4 different countries participated in this shared task, most of them used neural approaches. The best performing system achieves an accuracy of 83.95%, outperforming the baselines by a large margin, but still far from the human upper bound, which was found to be at 98%.

pdf abs
Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences
Daniel Khashabi | Snigdha Chaturvedi | Michael Roth | Shyam Upadhyay | Dan Roth
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We present a reading comprehension challenge in which questions can only be answered by taking into account information from multiple sentences. We solicit and verify questions and answers for this challenge through a 4-step crowdsourcing experiment. Our challenge dataset contains 6,500+ questions for 1000+ paragraphs across 7 different domains (elementary school science, news, travel guides, fiction stories, etc) bringing in linguistic diversity to the texts and to the questions wordings. On a subset of our dataset, we found human solvers to achieve an F1-score of 88.1%. We analyze a range of baselines, including a recent state-of-art reading comprehension system, and demonstrate the difficulty of this challenge, despite a high human performance. The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills.

2017

pdf bib
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics
Michael Roth | Nasrin Mostafazadeh | Nathanael Chambers | Annie Louis
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

pdf abs
LSDSem 2017 Shared Task: The Story Cloze Test
Nasrin Mostafazadeh | Michael Roth | Annie Louis | Nathanael Chambers | James Allen
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

The LSDSem’17 shared task is the Story Cloze Test, a new evaluation for story understanding and script learning. This test provides a system with a four-sentence story and two possible endings, and the system must choose the correct ending to the story. Successful narrative understanding (getting closer to human performance of 100%) requires systems to link various levels of semantics to commonsense knowledge. A total of eight systems participated in the shared task, with a variety of approaches including.

pdf
Role Semantics for Better Models of Implicit Discourse Relations
Michael Roth
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

pdf abs
Semantic Role Labeling
Diego Marcheggiani | Michael Roth | Ivan Titov | Benjamin Van Durme
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

This tutorial describes semantic role labelling (SRL), the task of mapping text to shallow semantic representations of eventualities and their participants. The tutorial introduces the SRL task and discusses recent research directions related to the task. The audience of this tutorial will learn about the linguistic background and motivation for semantic roles, and also about a range of computational models for this task, from early approaches to the current state-of-the-art. We will further discuss recently proposed variations to the traditional SRL task, including topics such as semantic proto-role labeling.We also cover techniques for reducing required annotation effort, such as methods exploiting unlabeled corpora (semi-supervised and unsupervised techniques), model adaptation across languages and domains, and methods for crowdsourcing semantic role annotation (e.g., question-answer driven SRL). Methods based on different machine learning paradigms, including neural networks, generative Bayesian models, graph-based algorithms and bootstrapping style techniques.Beyond sentence-level SRL, we discuss work that involves semantic roles in discourse. In particular, we cover data sets and models related to the task of identifying implicit roles and linking them to discourse antecedents. We introduce different approaches to this task from the literature, including models based on coreference resolution, centering, and selectional preferences. We also review how new insights gained through them can be useful for the traditional SRL task.

pdf abs
Aligning Script Events with Narrative Texts
Simon Ostermann | Michael Roth | Stefan Thater | Manfred Pinkal
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

Script knowledge plays a central role in text understanding and is relevant for a variety of downstream tasks. In this paper, we consider two recent datasets which provide a rich and general representation of script events in terms of paraphrase sets. We introduce the task of mapping event mentions in narrative texts to such script event types, and present a model for this task that exploits rich linguistic representations as well as information on temporal ordering. The results of our experiments demonstrate that this complex task is indeed feasible.

2016

pdf bib
Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods
Annie Louis | Michael Roth | Bonnie Webber | Michael White | Luke Zettlemoyer
Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods

pdf
Neural Semantic Role Labeling with Dependency Path Embeddings
Michael Roth | Mirella Lapata
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf abs
Context-aware Frame-Semantic Role Labeling
Michael Roth | Mirella Lapata
Transactions of the Association for Computational Linguistics, Volume 3

Frame semantic representations have been useful in several applications ranging from text-to-scene generation, to question answering and social network analysis. Predicting such representations from raw text is, however, a challenging task and corresponding models are typically only trained on a small set of sentence-level annotations. In this paper, we present a semantic role labeling system that takes into account sentence and discourse context. We introduce several new features which we motivate based on linguistic insights and experimentally demonstrate that they lead to significant improvements over the current state-of-the-art in FrameNet-based semantic role labeling.

pdf
Parsing Software Requirements with an Ontology-based Semantic Role Labeler
Michael Roth | Ewan Klein
Proceedings of the 1st Workshop on Language and Ontologies

pdf bib
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics
Michael Roth | Annie Louis | Bonnie Webber | Tim Baldwin
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics

pdf
Inducing Implicit Arguments from Comparable Texts: A Framework and Its Applications
Michael Roth | Anette Frank
Computational Linguistics, Volume 41, Issue 4 - December 2015

Distributional, corpus-based descriptions have frequently been applied to model aspects of word meaning. However, distributional models that use corpus data as their basis have one well-known disadvantage: even though the distributional features based on corpus co-occurrence were often successful in capturing meaning aspects of the words to be described, they generally fail to capture those meaning aspects that refer to world knowledge, because coherent texts tend not to provide redundant information that is presumably available knowledge. The question we ask in this paper is whether dictionary and encyclopaedic resources might complement the distributional information in corpus data, and provide world knowledge that is missing in corpora. As test case for meaning aspects, we rely on a collection of semantic associates to German verbs and nouns. Our results indicate that a combination of the knowledge resources should be helpful in work on distributional descriptions.