Udo Kruschwitz

Also published as: U. Kruschwitz


2023

pdf
Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts
Juntao Yu | Silviu Paun | Maris Camilleri | Paloma Garcia | Jon Chamberlain | Udo Kruschwitz | Massimo Poesio
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Although several datasets annotated for anaphoric reference / coreference exist, even the largest such datasets have limitations in term of size, range of domains, coverage of anaphoric phenomena, and size of documents included. Yet, the approaches proposed to scale up anaphoric annotation haven’t so far resulted in datasets overcoming these limitations. In this paper, we introduce a new release of a corpus for anaphoric reference labelled via a game-with-a-purpose. This new release is comparable in size to the largest existing corpora for anaphoric reference due in part to substantial activity by the players, in part thanks to the use of a new resolve-and-aggregate paradigm to ‘complete’ markable annotations through the combination of an anaphoric resolver and an aggregation method for anaphoric reference. The proposed method could be adopted to greatly speed up annotation time in other projects involving games-with-a-purpose. In addition, the corpus covers genres for which no comparable size datasets exist (Fiction and Wikipedia); it covers singletons and non-referring expressions; and it includes a substantial number of long documents ( 2K in length).

2022

pdf
MS@IW at SemEval-2022 Task 4: Patronising and Condescending Language Detection with Synthetically Generated Data
Selina Meyer | Maximilian Schmidhuber | Udo Kruschwitz
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

In this description paper we outline the system architecture submitted to Task 4, Subtask 1 at SemEval-2022. We leverage the generative power of state of the art generative pretrained transformer models to increase training set size and remedy class imbalance issues. Our best submitted system is trained on a synthetically enhanced dataset with 10.3 times as many positive samples as the original dataset and reaches an F1 score of 50.62%, which is 10 percentage points higher than our initial system trained on an undersampled version of the original dataset. We explore possible reasons for the comparably low score in the overall task ranking and report on experiments conducted during the post-evaluation phase.

pdf
Applying Automatic Text Summarization for Fake News Detection
Philipp Hartl | Udo Kruschwitz
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The distribution of fake news is not a new but a rapidly growing problem. The shift to news consumption via social media has been one of the drivers for the spread of misleading and deliberately wrong information, as in addition to its ease of use there is rarely any veracity monitoring. Due to the harmful effects of such fake news on society, the detection of these has become increasingly important. We present an approach to the problem that combines the power of transformer-based language models while simultaneously addressing one of their inherent problems. Our framework, CMTR-BERT, combines multiple text representations, with the goal of circumventing sequential limits and related loss of information the underlying transformer architecture typically suffers from. Additionally, it enables the incorporation of contextual information. Extensive experiments on two very different, publicly available datasets demonstrates that our approach is able to set new state-of-the-art performance benchmarks. Apart from the benefit of using automatic text summarization techniques we also find that the incorporation of contextual information contributes to performance gains.

pdf
A New Dataset for Topic-Based Paragraph Classification in Genocide-Related Court Transcripts
Miriam Schirmer | Udo Kruschwitz | Gregor Donabauer
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Recent progress in natural language processing has been impressive in many different areas with transformer-based approaches setting new benchmarks for a wide range of applications. This development has also lowered the barriers for people outside the NLP community to tap into the tools and resources applied to a variety of domain-specific applications. The bottleneck however still remains the lack of annotated gold-standard collections as soon as one’s research or professional interest falls outside the scope of what is readily available. One such area is genocide-related research (also including the work of experts who have a professional interest in accessing, exploring and searching large-scale document collections on the topic, such as lawyers). We present GTC (Genocide Transcript Corpus), the first annotated corpus of genocide-related court transcripts which serves three purposes: (1) to provide a first reference corpus for the community, (2) to establish benchmark performances (using state-of-the-art transformer-based approaches) for the new classification task of paragraph identification of violence-related witness statements, (3) to explore first steps towards transfer learning within the domain. We consider our contribution to be addressing in particular this year’s hot topic on Language Technology for All.

pdf
Tackling Irony Detection using Ensemble Classifiers
Christoph Turban | Udo Kruschwitz
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Automatic approaches to irony detection have been of interest to the NLP community for a long time, yet, state-of-the-art approaches still fall way short of what one would consider a desirable performance. In part this is due to the inherent difficulty of the problem. However, in recent years ensembles of transformer-based approaches have emerged as a promising direction to push the state of the art forward in a wide range of NLP applications. A different, more recent, development is the automatic augmentation of training data. In this paper we will explore both these directions for the task of irony detection in social media. Using the common SemEval 2018 Task 3 benchmark collection we demonstrate that transformer models are well suited in ensemble classifiers for the task at hand. In the multi-class classification task we observe statistically significant improvements over strong baselines. For binary classification we achieve performance that is on par with state-of-the-art alternatives. The examined data augmentation strategies showed an effect, but are not decisive for good results.

2021

pdf
ur-iw-hnt at GermEval 2021: An Ensembling Strategy with Multiple BERT Models
Hoai Nam Tran | Udo Kruschwitz
Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments

This paper describes our approach (ur-iw-hnt) for the Shared Task of GermEval2021 to identify toxic, engaging, and fact-claiming comments. We submitted three runs using an ensembling strategy by majority (hard) voting with multiple different BERT models of three different types: German-based, Twitter-based, and multilingual models. All ensemble models outperform single models, while BERTweet is the winner of all individual models in every subtask. Twitter-based models perform better than GermanBERT models, and multilingual models perform worse but by a small margin.

pdf
UR@NLP_A_Team @ GermEval 2021: Ensemble-based Classification of Toxic, Engaging and Fact-Claiming Comments
Kwabena Odame Akomeah | Udo Kruschwitz | Bernd Ludwig
Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments

In this paper, we report on our approach to addressing the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments for the German language. We submitted three runs for each subtask based on ensembles of three models each using contextual embeddings from pre-trained language models using SVM and neural-network-based classifiers. We include language-specific as well as language-agnostic language models – both with and without fine-tuning. We observe that for the runs we submitted that the SVM models overfitted the training data and this affected the aggregation method (simple majority voting) of the ensembles. The model records a lower performance on the test set than on the training set. Exploring the issue of overfitting we uncovered that due to a bug in the pipeline the runs we submitted had not been trained on the full set but only on a small training set. Therefore in this paper we also include the results we get when trained on the full training set which demonstrate the power of ensembles.

2020

pdf
Speaking Outside the Box: Exploring the Benefits of Unconstrained Input in Crowdsourcing and Citizen Science Platforms
Jon Chamberlain | Udo Kruschwitz | Massimo Poesio
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"

Crowdsourcing approaches provide a difficult design challenge for developers. There is a trade-off between the efficiency of the task to be done and the reward given to the user for participating, whether it be altruism, social enhancement, entertainment or money. This paper explores how crowdsourcing and citizen science systems collect data and complete tasks, illustrated by a case study from the online language game-with-a-purpose Phrase Detectives. The game was originally developed to be a constrained interface to prevent player collusion, but subsequently benefited from posthoc analysis of over 76k unconstrained inputs from users. Understanding the interface design and task deconstruction are critical for enabling users to participate in such systems and the paper concludes with a discussion of the idea that social networks can be viewed as form of citizen science platform with both constrained and unconstrained inputs making for a highly complex dataset.

2019

pdf
A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation
Massimo Poesio | Jon Chamberlain | Silviu Paun | Juntao Yu | Alexandra Uma | Udo Kruschwitz
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We present a corpus of anaphoric information (coreference) crowdsourced through a game-with-a-purpose. The corpus, containing annotations for about 108,000 markables, is one of the largest corpora for coreference for English, and one of the largest crowdsourced NLP corpora, but its main feature is the large number of judgments per markable: 20 on average, and over 2.2M in total. This characteristic makes the corpus a unique resource for the study of disagreements on anaphoric interpretation. A second distinctive feature is its rich annotation scheme, covering singletons, expletives, and split-antecedent plurals. Finally, the corpus also comes with labels inferred using a recently proposed probabilistic model of annotation for coreference. The labels are of high quality and make it possible to successfully train a state of the art coreference resolver, including training on singletons and non-referring expressions. The annotation model can also result in more than one label, or no label, being proposed for a markable, thus serving as a baseline method for automatically identifying ambiguous markables. A preliminary analysis of the results is presented.

pdf
Crowdsourcing and Aggregating Nested Markable Annotations
Chris Madge | Juntao Yu | Jon Chamberlain | Udo Kruschwitz | Silviu Paun | Massimo Poesio
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

One of the key steps in language resource creation is the identification of the text segments to be annotated, or markables, which depending on the task may vary from nominal chunks for named entity resolution to (potentially nested) noun phrases in coreference resolution (or mentions) to larger text segments in text segmentation. Markable identification is typically carried out semi-automatically, by running a markable identifier and correcting its output by hand–which is increasingly done via annotators recruited through crowdsourcing and aggregating their responses. In this paper, we present a method for identifying markables for coreference annotation that combines high-performance automatic markable detectors with checking with a Game-With-A-Purpose (GWAP) and aggregation using a Bayesian annotation model. The method was evaluated both on news data and data from a variety of other genres and results in an improvement on F1 of mention boundaries of over seven percentage points when compared with a state-of-the-art, domain-independent automatic mention detector, and almost three points over an in-domain mention detector. One of the key contributions of our proposal is its applicability to the case in which markables are nested, as is the case with coreference markables; but the GWAP and several of the proposed markable detectors are task and language-independent and are thus applicable to a variety of other annotation scenarios.

2018

pdf
Improving Hate Speech Detection with Deep Learning Ensembles
Steven Zimmerman | Udo Kruschwitz | Chris Fox
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Scalable Visualisation of Sentiment and Stance
Jon Chamberlain | Udo Kruschwitz | Orland Hoeber
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Comparing Bayesian Models of Annotation
Silviu Paun | Bob Carpenter | Jon Chamberlain | Dirk Hovy | Udo Kruschwitz | Massimo Poesio
Transactions of the Association for Computational Linguistics, Volume 6

The analysis of crowdsourced annotations in natural language processing is concerned with identifying (1) gold standard labels, (2) annotator accuracies and biases, and (3) item difficulties and error patterns. Traditionally, majority voting was used for 1, and coefficients of agreement for 2 and 3. Lately, model-based analysis of corpus annotations have proven better at all three tasks. But there has been relatively little work comparing them on the same datasets. This paper aims to fill this gap by analyzing six models of annotation, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items. We evaluate these models along four aspects: comparison to gold labels, predictive accuracy for new annotations, annotator characterization, and item difficulty, using four datasets with varying degrees of noise in the form of random (spammy) annotators. We conclude with guidelines for model selection, application, and implementation.

pdf
A Probabilistic Annotation Model for Crowdsourcing Coreference
Silviu Paun | Jon Chamberlain | Udo Kruschwitz | Juntao Yu | Massimo Poesio
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The availability of large scale annotated corpora for coreference is essential to the development of the field. However, creating resources at the required scale via expert annotation would be too expensive. Crowdsourcing has been proposed as an alternative; but this approach has not been widely used for coreference. This paper addresses one crucial hurdle on the way to make this possible, by introducing a new model of annotation for aggregating crowdsourced anaphoric annotations. The model is evaluated along three dimensions: the accuracy of the inferred mention pairs, the quality of the post-hoc constructed silver chains, and the viability of using the silver chains as an alternative to the expert-annotated chains in training a state of the art coreference system. The results suggest that our model can extract from crowdsourced annotations coreference chains of comparable quality to those obtained with expert annotation.

2016

pdf
The OnForumS corpus from the Shared Task on Online Forum Summarisation at MultiLing 2015
Mijail Kabadjov | Udo Kruschwitz | Massimo Poesio | Josef Steinberger | Jorge Valderrama | Hugo Zaragoza
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present the OnForumS corpus developed for the shared task of the same name on Online Forum Summarisation (OnForumS at MultiLing’15). The corpus consists of a set of news articles with associated readers’ comments from The Guardian (English) and La Repubblica (Italian). It comes with four levels of annotation: argument structure, comment-article linking, sentiment and coreference. The former three were produced through crowdsourcing, whereas the latter, by an experienced annotator using a mature annotation scheme. Given its annotation breadth, we believe the corpus will prove a useful resource in stimulating and furthering research in the areas of Argumentation Mining, Summarisation, Sentiment, Coreference and the interlinks therein.

pdf
Towards a Corpus of Violence Acts in Arabic Social Media
Ayman Alhelbawy | Poesio Massimo | Udo Kruschwitz
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present a new corpus of Arabic tweets that mention some form of violent event, developed to support the automatic identification of Human Rights Abuse. The dataset was manually labelled for seven classes of violence using crowdsourcing.

pdf
Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference.
Jon Chamberlain | Massimo Poesio | Udo Kruschwitz
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Natural Language Engineering tasks require large and complex annotated datasets to build more advanced models of language. Corpora are typically annotated by several experts to create a gold standard; however, there are now compelling reasons to use a non-expert crowd to annotate text, driven by cost, speed and scalability. Phrase Detectives Corpus 1.0 is an anaphorically-annotated corpus of encyclopedic and narrative text that contains a gold standard created by multiple experts, as well as a set of annotations created by a large non-expert crowd. Analysis shows very good inter-expert agreement (kappa=.88-.93) but a more variable baseline crowd agreement (kappa=.52-.96). Encyclopedic texts show less agreement (and by implication are harder to annotate) than narrative texts. The release of this corpus is intended to encourage research into the use of crowds for text annotation and the development of more advanced, probabilistic language models, in particular for anaphoric coreference.

2015

pdf
MultiLing 2015: Multilingual Summarization of Single and Multi-Documents, On-line Fora, and Call-center Conversations
George Giannakopoulos | Jeff Kubina | John Conroy | Josef Steinberger | Benoit Favre | Mijail Kabadjov | Udo Kruschwitz | Massimo Poesio
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf
Combining Minimally-supervised Methods for Arabic Named Entity Recognition
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Transactions of the Association for Computational Linguistics, Volume 3

Supervised methods can achieve high performance on NLP tasks, such as Named Entity Recognition (NER), but new annotations are required for every new domain and/or genre change. This has motivated research in minimally supervised methods such as semi-supervised learning and distant learning, but neither technique has yet achieved performance levels comparable to those of supervised methods. Semi-supervised methods tend to have very high precision but comparatively low recall, whereas distant learning tends to achieve higher recall but lower precision. This complementarity suggests that better results may be obtained by combining the two types of minimally supervised methods. In this paper we present a novel approach to Arabic NER using a combination of semi-supervised and distant learning techniques. We trained a semi-supervised NER classifier and another one using distant learning techniques, and then combined them using a variety of classifier combination schemes, including the Bayesian Classifier Combination (BCC) procedure recently proposed for sentiment analysis. According to our results, the BCC model leads to an increase in performance of 8 percentage points over the best base classifiers.

2014

pdf
Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf
AraNLP: a Java-based Library for the Processing of Arabic Text.
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a free, Java-based library named “AraNLP” that covers various Arabic text preprocessing tools. Although a good number of tools for processing Arabic text already exist, integration and compatibility problems continually occur. AraNLP is an attempt to gather most of the vital Arabic text preprocessing tools into one library that can be accessed easily by integrating or accurately adapting existing tools and by developing new ones when required. The library includes a sentence detector, tokenizer, light stemmer, root stemmer, part-of speech tagger (POS-tagger), word segmenter, normalizer, and a punctuation and diacritic remover.

2013

pdf
A Semi-supervised Learning Approach to Arabic Named Entity Recognition
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf
Assessing Crowdsourcing Quality through Objective Tasks
Ahmet Aker | Mahmoud El-Haj | M-Dyaa Albakour | Udo Kruschwitz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The emergence of crowdsourcing as a commonly used approach to collect vast quantities of human assessments on a variety of tasks represents nothing less than a paradigm shift. This is particularly true in academic research where it has suddenly become possible to collect (high-quality) annotations rapidly without the need of an expert. In this paper we investigate factors which can influence the quality of the results obtained through Amazon's Mechanical Turk crowdsourcing platform. We investigated the impact of different presentation methods (free text versus radio buttons), workers' base (USA versus India as the main bases of MTurk workers) and payment scale (about $4, $8 and $10 per hour) on the quality of the results. For each run we assessed the results provided by 25 workers on a set of 10 tasks. We run two different experiments using objective tasks: maths and general text questions. In both tasks the answers are unique, which eliminates the uncertainty usually present in subjective tasks, where it is not clear whether the unexpected answer is caused by a lack of worker's motivation, the worker's interpretation of the task or genuine ambiguity. In this work we present our results comparing the influence of the different factors used. One of the interesting findings is that our results do not confirm previous studies which concluded that an increase in payment attracts more noise. We also find that the country of origin only has an impact in some of the categories and only in general text questions but there is no significant difference at the top pay.

pdf
Applying Random Indexing to Structured Data to Find Contextually Similar Words
Danica Damljanović | Udo Kruschwitz | M-Dyaa Albakour | Johann Petrak | Mihai Lupu
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Language resources extracted from structured data (e.g. Linked Open Data) have already been used in various scenarios to improve conventional Natural Language Processing techniques. The meanings of words and the relations between them are made more explicit in RDF graphs, in comparison to human-readable text, and hence have a great potential to improve legacy applications. In this paper, we describe an approach that can be used to extend or clarify the semantic meaning of a word by constructing a list of contextually related terms. Our approach is based on exploiting the structure inherent in an RDF graph and then applying the methods from statistical semantics, and in particular, Random Indexing, in order to discover contextually related terms. We evaluate our approach in the domain of life science using the dataset generated with the help of domain experts from a large pharmaceutical company (AstraZeneca). They were involved in two phases: firstly, to generate a set of keywords of interest to them, and secondly to judge the set of generated contextually similar words for each keyword of interest. We compare our proposed approach, exploiting the semantic graph, with the same method applied on the human readable text extracted from the graph.

pdf bib
Finding the Right Supervisor: Expert-Finding in a University Domain
Fawaz Alarfaj | Udo Kruschwitz | David Hunter | Chris Fox
Proceedings of the NAACL HLT 2012 Student Research Workshop

2009

pdf
Constructing an Anaphorically Annotated Corpus with Non-Experts: Assessing the Quality of Collaborative Annotations
Jon Chamberlain | Udo Kruschwitz | Massimo Poesio
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)

2008

pdf
ANAWIKI: Creating Anaphorically Annotated Resources through Web Cooperation
Massimo Poesio | Udo Kruschwitz | Jon Chamberlain
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The ability to make progress in Computational Linguistics depends on the availability of large annotated corpora, but creating such corpora by hand annotation is very expensive and time consuming; in practice, it is unfeasible to think of annotating more than one million words. However, the success of Wikipedia and other projects shows that another approach might be possible: take advantage of the willingness of Web users to contribute to collaborative resource creation. AnaWiki is a recently started project that will develop tools to allow and encourage large numbers of volunteers over the Web to collaborate in the creation of semantically annotated corpora (in the first instance, of a corpus annotated with information about anaphora).

pdf
Addressing the Resource Bottleneck to Create Large-Scale Annotated Texts
Jon Chamberlain | Massimo Poesio | Udo Kruschwitz
Semantics in Text Processing. STEP 2008 Conference Proceedings

2006

pdf
An Anaphora Resolution-Based Anonymization Module
M. Poesio | M. A. Kabadjov | P. Goux | U. Kruschwitz | E. Bishop | L. Corti
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Growing privacy and security concerns mean there is an increasing need for data to be anonymized before being publically released. We present a module for anonymizing references implemented as part of the SQUAD tools for specifying and testing non-proprietary means of storing and marking-up data using universal (XML) standards and technologies. The tool is implemented on top of the GUITAR anaphoric resolver.