Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference.

Jon Chamberlain; Massimo Poesio; Udo Kruschwitz

Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference.

Jon Chamberlain, Massimo Poesio, Udo Kruschwitz

Abstract

Natural Language Engineering tasks require large and complex annotated datasets to build more advanced models of language. Corpora are typically annotated by several experts to create a gold standard; however, there are now compelling reasons to use a non-expert crowd to annotate text, driven by cost, speed and scalability. Phrase Detectives Corpus 1.0 is an anaphorically-annotated corpus of encyclopedic and narrative text that contains a gold standard created by multiple experts, as well as a set of annotations created by a large non-expert crowd. Analysis shows very good inter-expert agreement (kappa=.88-.93) but a more variable baseline crowd agreement (kappa=.52-.96). Encyclopedic texts show less agreement (and by implication are harder to annotate) than narrative texts. The release of this corpus is intended to encourage research into the use of crowds for text annotation and the development of more advanced, probabilistic language models, in particular for anaphoric coreference.

Anthology ID:: L16-1323
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 2039–2046
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/L16-1323/
DOI:
Bibkey:
Cite (ACL):: Jon Chamberlain, Massimo Poesio, and Udo Kruschwitz. 2016. Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference.. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2039–2046, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference. (Chamberlain et al., LREC 2016)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/L16-1323.pdf

PDF Cite Search Fix data