Centroids: Gold standards with distributional variation

Ian Lewin, Şenay Kafkas, Dietrich Rebholz-Schuhmann


Abstract
Motivation: Gold Standards for named entities are, ironically, not standard themselves. Some specify the “one perfect annotation”. Others specify “perfectly good alternatives”. The concept of Silver standard is relatively new. The objective is consensus rather than perfection. How should the two concepts be best represented and related? Approach: We examine several Biomedical Gold Standards and motivate a new representational format, centroids, which simply and effectively represents name distributions. We define an algorithm for finding centroids, given a set of alternative input annotations and we test the outputs quantitatively and qualitatively. We also define a metric of relatively acceptability on top of the centroid standard. Results: Precision, recall and F-scores of over 0.99 are achieved for the simple sanity check of giving the algorithm Gold Standard inputs. Qualitative analysis of the differences very often reveals errors and incompleteness in the original Gold Standard. Given automatically generated annotations, the centroids effectively represent the range of those contributions and the quality of the centroid annotations is highly competitive with the best of the contributors. Conclusion: Centroids cleanly represent alternative name variations for Silver and Gold Standards. A centroid Silver Standard is derived just like a Gold Standard, only from imperfect inputs.
Anthology ID:
L12-1364
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3894–3900
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/633_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Ian Lewin, Şenay Kafkas, and Dietrich Rebholz-Schuhmann. 2012. Centroids: Gold standards with distributional variation. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3894–3900, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Centroids: Gold standards with distributional variation (Lewin et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/633_Paper.pdf