Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP

Anthology ID:: D19-59
Month:: November
Year:: 2019
Address:: Hong Kong
Venue:: WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/D19-59
DOI:
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/emnlp-22-attachments/D19-59.pdf

pdf bib
Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP
Silviu Paun | Dirk Hovy

pdf bib abs
Dependency Tree Annotation with Mechanical Turk
Stephen Tratz

Crowdsourcing is frequently employed to quickly and inexpensively obtain valuable linguistic annotations but is rarely used for parsing, likely due to the perceived difficulty of the task and the limited training of the available workers. This paper presents what is, to the best of our knowledge, the first published use of Mechanical Turk (or similar platform) to crowdsource parse trees. We pay Turkers to construct unlabeled dependency trees for 500 English sentences using an interactive graphical dependency tree editor, collecting 10 annotations per sentence. Despite not requiring any training, several of the more prolific workers meet or exceed 90% attachment agreement with the Penn Treebank (PTB) portion of our data, and, furthermore, for 72% of these PTB sentences, at least one Turker produces a perfect parse. Thus, we find that, supported with a simple graphical interface, people with presumably no prior experience can achieve surprisingly high degrees of accuracy on this task. To facilitate research into aggregation techniques for complex crowdsourced annotations, we publicly release our annotated corpus.

pdf bib abs
Word Familiarity Rate Estimation Using a Bayesian Linear Mixed Model
Masayuki Asahara

This paper presents research on word familiarity rate estimation using the ‘Word List by Semantic Principles’. We collected rating information on 96,557 words in the ‘Word List by Semantic Principles’ via Yahoo! crowdsourcing. We asked 3,392 subject participants to use their introspection to rate the familiarity of words based on the five perspectives of ‘KNOW’, ‘WRITE’, ‘READ’, ‘SPEAK’, and ‘LISTEN’, and each word was rated by at least 16 subject participants. We used Bayesian linear mixed models to estimate the word familiarity rates. We also explored the ratings with the semantic labels used in the ‘Word List by Semantic Principles’.

pdf abs
Leveraging syntactic parsing to improve event annotation matching
Camiel Colruyt | Orphée De Clercq | Véronique Hoste

Detecting event mentions is the first step in event extraction from text and annotating them is a notoriously difficult task. Evaluating annotator consistency is crucial when building datasets for mention detection. When event mentions are allowed to cover many tokens, annotators may disagree on their span, which means that overlapping annotations may then refer to the same event or to different events. This paper explores different fuzzy-matching functions which aim to resolve this ambiguity. The functions extract the sets of syntactic heads present in the annotations, use the Dice coefficient to measure the similarity between sets and return a judgment based on a given threshold. The functions are tested against the judgment of a human evaluator and a comparison is made between sets of tokens and sets of syntactic heads. The best-performing function is a head-based function that is found to agree with the human evaluator in 89% of cases.

pdf abs
A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation
Jiyi Li | Fumiyo Fukumoto

The target outputs of many NLP tasks are word sequences. To collect the data for training and evaluating models, the crowd is a cheaper and easier to access than the oracle. To ensure the quality of the crowdsourced data, people can assign multiple workers to one question and then aggregate the multiple answers with diverse quality into a golden one. How to aggregate multiple crowdsourced word sequences with diverse quality is a curious and challenging problem. People need a dataset for addressing this problem. We thus create a dataset (CrowdWSA2019) which contains the translated sentences generated from multiple workers. We provide three approaches as the baselines on the task of extractive word sequence aggregation. Specially, one of them is an original one we propose which models the reliability of workers. We also discuss some issues on ground truth creation of word sequences which can be addressed based on this dataset.

Recent advancements in machine reading and listening comprehension involve the annotation of long texts. Such tasks are typically time consuming, making crowd-annotations an attractive solution, yet their complexity often makes such a solution unfeasible. In particular, a major concern is that crowd annotators may be tempted to skim through long texts, and answer questions without reading thoroughly. We present a case study of adapting this type of task to the crowd. The task is to identify claims in a several minute long debate speech. We show that sentence-by-sentence annotation does not scale and that labeling only a subset of sentences is insufficient. Instead, we propose a scheme for effectively performing the full, complex task with crowd annotators, allowing the collection of large scale annotated datasets. We believe that the encountered challenges and pitfalls, as well as lessons learned, are relevant in general when collecting data for large scale natural language understanding (NLU) tasks.

pdf abs
Computer Assisted Annotation of Tension Development in TED Talks through Crowdsourcing
Seungwon Yoon | Wonsuk Yang | Jong Park

We propose a method of machine-assisted annotation for the identification of tension development, annotating whether the tension is increasing, decreasing, or staying unchanged. We use a neural network based prediction model, whose predicted results are given to the annotators as initial values for the options that they are asked to choose. By presenting such initial values to the annotators, the annotation task becomes an evaluation task where the annotators inspect whether or not the predicted results are correct. To demonstrate the effectiveness of our method, we performed the annotation task in both in-house and crowdsourced environments. For the crowdsourced environment, we compared the annotation results with and without our method of machine-assisted annotation. We find that the results with our method showed a higher agreement to the gold standard than those without, though our method had little effect at reducing the time for annotation. Our codes for the experiment are made publicly available.

pdf abs
CoSSAT: Code-Switched Speech Annotation Tool
Sanket Shah | Pratik Joshi | Sebastin Santy | Sunayana Sitaram

Code-switching refers to the alternation of two or more languages in a conversation or utterance and is common in multilingual communities across the world. Building code-switched speech and natural language processing systems are challenging due to the lack of annotated speech and text data. We present a speech annotation interface CoSSAT, which helps annotators transcribe code-switched speech faster, more easily and more accurately than a traditional interface, by displaying candidate words from monolingual speech recognizers. We conduct a user study on the transcription of Hindi-English code-switched speech with 10 annotators and describe quantitative and qualitative results.