Abstract
The target outputs of many NLP tasks are word sequences. To collect the data for training and evaluating models, the crowd is a cheaper and easier to access than the oracle. To ensure the quality of the crowdsourced data, people can assign multiple workers to one question and then aggregate the multiple answers with diverse quality into a golden one. How to aggregate multiple crowdsourced word sequences with diverse quality is a curious and challenging problem. People need a dataset for addressing this problem. We thus create a dataset (CrowdWSA2019) which contains the translated sentences generated from multiple workers. We provide three approaches as the baselines on the task of extractive word sequence aggregation. Specially, one of them is an original one we propose which models the reliability of workers. We also discuss some issues on ground truth creation of word sequences which can be addressed based on this dataset.- Anthology ID:
- D19-5904
- Volume:
- Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong
- Editors:
- Silviu Paun, Dirk Hovy
- Venue:
- WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 24–28
- Language:
- URL:
- https://aclanthology.org/D19-5904
- DOI:
- 10.18653/v1/D19-5904
- Cite (ACL):
- Jiyi Li and Fumiyo Fukumoto. 2019. A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation. In Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP, pages 24–28, Hong Kong. Association for Computational Linguistics.
- Cite (Informal):
- A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation (Li & Fukumoto, 2019)
- PDF:
- https://preview.aclanthology.org/naacl24-info/D19-5904.pdf