Merel Scholman


Label distributions help implicit discourse relation classification
Frances Yung | Kaveri Anuranjana | Merel Scholman | Vera Demberg
Proceedings of the 3rd Workshop on Computational Approaches to Discourse

Implicit discourse relations can convey more than one relation sense, but much of the research on discourse relations has focused on single relation senses. Recently, DiscoGeM, a novel multi-domain corpus, which contains 10 crowd-sourced labels per relational instance, has become available. In this paper, we analyse the co-occurrences of relations in DiscoGem and show that they are systematic and characteristic of text genre. We then test whether information on multi-label distributions in the data can help implicit relation classifiers. Our results show that incorporating multiple labels in parser training can improve its performance, and yield label distributions which are more similar to human label distributions, compared to a parser that is trained on just a single most frequent label per instance.

Design Choices in Crowdsourcing Discourse Relation Annotations: The Effect of Worker Selection and Training
Merel Scholman | Valentina Pyatkin | Frances Yung | Ido Dagan | Reut Tsarfaty | Vera Demberg
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Obtaining linguistic annotation from novice crowdworkers is far from trivial. A case in point is the annotation of discourse relations, which is a complicated task. Recent methods have obtained promising results by extracting relation labels from either discourse connectives (DCs) or question-answer (QA) pairs that participants provide. The current contribution studies the effect of worker selection and training on the agreement on implicit relation labels between workers and gold labels, for both the DC and the QA method. In Study 1, workers were not specifically selected or trained, and the results show that there is much room for improvement. Study 2 shows that a combination of selection and training does lead to improved results, but the method is cost- and time-intensive. Study 3 shows that a selection-only approach is a viable alternative; it results in annotations of comparable quality compared to annotations from trained participants. The results generalized over both the DC and QA method and therefore indicate that a selection-only approach could also be effective for other crowdsourced discourse annotation tasks.

DiscoGeM: A Crowdsourced Corpus of Genre-Mixed Implicit Discourse Relations
Merel Scholman | Tianai Dong | Frances Yung | Vera Demberg
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present DiscoGeM, a crowdsourced corpus of 6,505 implicit discourse relations from three genres: political speech, literature, and encyclopedic texts. Each instance was annotated by 10 crowd workers. Various label aggregation methods were explored to evaluate how to obtain a label that best captures the meaning inferred by the crowd annotators. The results show that a significant proportion of discourse relations in DiscoGeM are ambiguous and can express multiple relation senses. Probability distribution labels better capture these interpretations than single labels. Further, the results emphasize that text genre crucially affects the distribution of discourse relations, suggesting that genre should be included as a factor in automatic relation classification. We make available the newly created DiscoGeM corpus, as well as the dataset with all annotator-level labels. Both the corpus and the dataset can facilitate a multitude of applications and research purposes, for example to function as training data to improve the performance of automatic discourse relation parsers, as well as facilitate research into non-connective signals of discourse relations.

Establishing Annotation Quality in Multi-label Annotations
Marian Marchal | Merel Scholman | Frances Yung | Vera Demberg
Proceedings of the 29th International Conference on Computational Linguistics

In many linguistic fields requiring annotated data, multiple interpretations of a single item are possible. Multi-label annotations more accurately reflect this possibility. However, allowing for multi-label annotations also affects the chance that two coders agree with each other. Calculating inter-coder agreement for multi-label datasets is therefore not trivial. In the current contribution, we evaluate different metrics for calculating agreement on multi-label annotations: agreement on the intersection of annotated labels, an augmented version of Cohen’s Kappa, and precision, recall and F1. We propose a bootstrapping method to obtain chance agreement for each measure, which allows us to obtain an adjusted agreement coefficient that is more interpretable. We demonstrate how various measures affect estimates of agreement on simulated datasets and present a case study of discourse relation annotations. We also show how the proportion of double labels, and the entropy of the label distribution, influences the measures outlined above and how a bootstrapped adjusted agreement can make agreement measures more comparable across datasets in multi-label scenarios.


A practical perspective on connective generation
Frances Yung | Merel Scholman | Vera Demberg
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

In data-driven natural language generation, we typically know what relation should be expressed and need to select a connective to lexicalize it. In the current contribution, we analyse whether a sophisticated connective generation module is necessary to select a connective, or whether this can be solved with simple methods (such as random choice between connectives that are known to express a given relation, or usage of a generic language model). Comparing these methods to the distributions of connective choices from a human connective insertion task, we find mixed results: for some relations, it is acceptable to lexicalize them using any of the connectives that mark this relation. However, for other relations (temporals, concessives) either a more detailed relation distinction needs to be introduced, or a more sophisticated connective choice module would be necessary.

Semi-automatic discourse annotation in a low-resource language: Developing a connective lexicon for Nigerian Pidgin
Marian Marchal | Merel Scholman | Vera Demberg
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

Cross-linguistic research on discourse structure and coherence marking requires discourse-annotated corpora and connective lexicons in a large number of languages. However, the availability of such resources is limited, especially for languages for which linguistic resources are scarce in general, such as Nigerian Pidgin. In this study, we demonstrate how a semi-automatic approach can be used to source connectives and their relation senses and develop a discourse-annotated corpus in a low-resource language. Connectives and their relation senses were extracted from a parallel corpus combining automatic (PDTB end-to-end parser) and manual annotations. This resulted in Naija-Lex, a lexicon of discourse connectives in Nigerian Pidgin with English translations. The lexicon shows that the majority of Nigerian Pidgin connectives are borrowed from its English lexifier, but that there are also some connectives that are unique to Nigerian Pidgin.

Comparison of methods for explicit discourse connective identification across various domains
Merel Scholman | Tianai Dong | Frances Yung | Vera Demberg
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

Existing parse methods use varying approaches to identify explicit discourse connectives, but their performance has not been consistently evaluated in comparison to each other, nor have they been evaluated consistently on text other than newspaper articles. We here assess the performance on explicit connective identification of three parse methods (PDTB e2e, Lin et al., 2014; the winner of CONLL2015, Wang et al., 2015; and DisSent, Nie et al., 2019), along with a simple heuristic. We also examine how well these systems generalize to different datasets, namely written newspaper text (PDTB), written scientific text (BioDRB), prepared spoken text (TED-MDB) and spontaneous spoken text (Disco-SPICE). The results show that the e2e parser outperforms the other parse methods in all datasets. However, performance drops significantly from the PDTB to all other datasets. We provide a more fine-grained analysis of domain differences and connectives that prove difficult to parse, in order to highlight the areas where gains can be made.


Crowdsourcing Discourse Relation Annotations by a Two-Step Connective Insertion Task
Frances Yung | Vera Demberg | Merel Scholman
Proceedings of the 13th Linguistic Annotation Workshop

The perspective of being able to crowd-source coherence relations bears the promise of acquiring annotations for new texts quickly, which could then increase the size and variety of discourse-annotated corpora. It would also open the avenue to answering new research questions: Collecting annotations from a larger number of individuals per instance would allow to investigate the distribution of inferred relations, and to study individual differences in coherence relation interpretation. However, annotating coherence relations with untrained workers is not trivial. We here propose a novel two-step annotation procedure, which extends an earlier method by Scholman and Demberg (2017a). In our approach, coherence relation labels are inferred from connectives that workers insert into the text. We show that the proposed method leads to replicable coherence annotations, and analyse the agreement between the obtained relation labels and annotations from PDTB and RSTDT on the same texts.


Crowdsourcing discourse interpretations: On the influence of context and the reliability of a connective insertion task
Merel Scholman | Vera Demberg
Proceedings of the 11th Linguistic Annotation Workshop

Traditional discourse annotation tasks are considered costly and time-consuming, and the reliability and validity of these tasks is in question. In this paper, we investigate whether crowdsourcing can be used to obtain reliable discourse relation annotations. We also examine the influence of context on the reliability of the data. The results of a crowdsourced connective insertion task showed that the method can be used to obtain reliable annotations: The majority of the inserted connectives converged with the original label. Further, the method is sensitive to the fact that multiple senses can often be inferred for a single relation. Regarding the presence of context, the results show no significant difference in distributions of insertions between conditions overall. However, a by-item comparison revealed several characteristics of segments that determine whether the presence of context makes a difference in annotations. The findings discussed in this paper can be taken as evidence that crowdsourcing can be used as a valuable method to obtain insights into the sense(s) of relations.

pdf bib
Evaluating discourse annotation: Some recent insights and new approaches
Jet Hoek | Merel Scholman
Proceedings of the 13th Joint ISO-ACL Workshop on Interoperable Semantic Annotation (ISA-13)


Annotating Discourse Relations in Spoken Language: A Comparison of the PDTB and CCR Frameworks
Ines Rehbein | Merel Scholman | Vera Demberg
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In discourse relation annotation, there is currently a variety of different frameworks being used, and most of them have been developed and employed mostly on written data. This raises a number of questions regarding interoperability of discourse relation annotation schemes, as well as regarding differences in discourse annotation for written vs. spoken domains. In this paper, we describe ouron annotating two spoken domains from the SPICE Ireland corpus (telephone conversations and broadcast interviews) according todifferent discourse annotation schemes, PDTB 3.0 and CCR. We show that annotations in the two schemes can largely be mappedone another, and discuss differences in operationalisations of discourse relation schemes which present a challenge to automatic mapping. We also observe systematic differences in the prevalence of implicit discourse relations in spoken data compared to written texts,find that there are also differences in the types of causal relations between the domains. Finally, we find that PDTB 3.0 addresses many shortcomings of PDTB 2.0 wrt. the annotation of spoken discourse, and suggest further extensions. The new corpus has roughly theof the CoNLL 2015 Shared Task test set, and we hence hope that it will be a valuable resource for the evaluation of automatic discourse relation labellers.