Instructions
This is the Human
Evaluation Datasheet (HEDS) form. Within each section there are
questions about the human evaluation experiment for which details are
being recorded. There can be multiple subsections within each section
and each can be expanded or collapsed.
This form is not
submitted to any server when it is completed, instead please use the
"download json" button in the "Download to file" section. This will
download a file (in .json format) that contains the current values from
each form field. You can also upload a json file (see the "Upload from
file" section" on the left of the screen). Warning: This will delete
your current form content, then populate the blank form with content
from the file. It is advisable to download files as a backup when you
are compelting the form. The form saves the field values in local
storage of your browser, it will be deleted if you clear the local
storage, or if you are in a private/incognito window and then close it.
The form will not prevent
you from downloading your save file, even when there are error or
warning messages. Yellow warning messages indicate fields that have not
been completed. If a field is not relevant for your experiment, enter
N/A, and ideally also explain why. Red messages are errors, for example
if the form expects an integer and you have entered something else, a
red message will be shown. These will still not prevent you from saving
the form.
You can generate a list
of all current errors/warnings, along with their section numbers, in the
"all form errors" tab at the bottom of the form. A count of errors
will also be refreshed every 60 seconds on the panel on the left side of
the screen.
Section 4 should be
completed for each criterion that is evaluated in the experiment.
Instructions on how to do this are shown when at the start of the
section.
Credits
Questions 2.1–2.5 relating
to evaluated system, and 4.3.1–4.3.8 relating to response elicitation,
are based on Howcroft et al. (2020), with some significant changes.
Questions 4.1.1–4.2.3 relating to quality criteria, and some of the
questions about system outputs, evaluators, and experimental design
(3.1.1–3.2.3, 4.3.5, 4.3.6, 4.3.9–4.3.11) are based on Belz et al.
(2020). HEDS was also informed by van der Lee et al. (2019, 2021) and by
Gehrmann et al. (2021)’s[6] data card guide.
More generally, the original
inspiration for creating a ‘datasheet’ for describing human evaluation
experiments of course comes from seminal papers by Bender & Friedman
(2018), Mitchell et al. (2019) and Gebru et al. (2020).
References
References
Banarescu, L., Bonial, C.,
Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., Knight, K., Koehn,
P., Palmer, M., & Schneider, N. (2013). Abstract Meaning
Representation for sembanking. Proceedings of the 7th Linguistic
Annotation Workshop and Interoperability with Discourse, 178–186.
https://www.aclweb.org/anthology/W13-2322
Belz, A., Mille, S., &
Howcroft, D. M. (2020). Disentangling the properties of human evaluation
methods: A classification system to support comparability,
meta-evaluation and reproducibility testing. Proceedings of the 13th
International Conference on Natural Language Generation, 183–194.
Bender, E. M., &
Friedman, B. (2018). Data statements for natural language processing:
Toward mitigating system bias and enabling better science. Transactions
of the Association for Computational Linguistics, 6, 587–604.
https://doi.org/10.1162/tacl_a_00041
Card, D., Henderson, P.,
Khandelwal, U., Jia, R., Mahowald, K., & Jurafsky, D. (2020). With
little power comes great responsibility. Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (Emnlp),
9263–9274.
https://doi.org/10.18653/v1/2020.emnlp-main.745
Gebru, T., Morgenstern, J.,
Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford,
K. (2020). Datasheets for datasets.
http://arxiv.org/abs/1803.09010
Gehrmann, S., Adewumi, T.,
Aggarwal, K., Ammanamanchi, P. S., Anuoluwapo, A., Bosselut, A., Chandu,
K. R., Clinciu, M., Das, D., Dhole, K. D., Du, W., Durmus, E., Dušek,
O., Emezue, C., Gangal, V., Garbacea, C., Hashimoto, T., Hou, Y.,
Jernite, Y., … Zhou, J. (2021). The GEM benchmark: Natural language
generation, its evaluation and metrics.
http://arxiv.org/abs/2102.01672
Howcroft, D. M., Belz, A.,
Clinciu, M.-A., Gkatzia, D., Hasan, S. A., Mahamood, S., Mille, S.,
Miltenburg, E. van, Santhanam, S., & Rieser, V. (2020). Twenty years
of confusion in human evaluation: NLG needs evaluation sheets and
standardised definitions. Proceedings of the 13th International
Conference on Natural Language Generation, 169–182.
https://www.aclweb.org/anthology/2020.inlg-1.23
Howcroft, D. M., &
Rieser, V. (2021). What happens if you treat ordinal ratings as interval
data? Human evaluations in NLP are even more under-powered than you
think. Proceedings of the 2021 Conference on Empirical Methods in
Natural Language Processing, 8932–8939.
https://doi.org/10.18653/v1/2021.emnlp-main.703
Kamp, H., & Reyle, U.
(2013). From discourse to logic: Introduction to modeltheoretic
semantics of natural language, formal logic and discourse representation
theory (Vol. 42). Springer Science & Business Media.
Mitchell, M., Wu, S.,
Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E.,
Raji, I. D., & Gebru, T. (2019). Model cards for model reporting.
Proceedings of the Conference on Fairness, Accountability, and
Transparency, 220–229.
https://doi.org/10.1145/3287560.3287596
Shimorina, A., & Belz,
A. (2022). The human evaluation datasheet: A template for recording
details of human evaluation experiments in NLP. Proceedings of the 2nd
Workshop on Human Evaluation of Nlp Systems (Humeval), 54–75.
https://aclanthology.org/2022.humeval-1.6
van der Lee, C., Gatt, A.,
Miltenburg, E. van, Wubben, S., & Krahmer, E. (2019). Best practices
for the human evaluation of automatically generated text. Proceedings
of the 12th International Conference on Natural Language Generation,
355–368.
https://www.aclweb.org/anthology/W19-8643.pdf
van der Lee, C., Gatt, A.,
van Miltenburg, E., & Krahmer, E. (2021). Human evaluation of
automatically generated text: Current trends and best practice
guidelines. Computer Speech & Language, 67, 101151.
https://doi.org/10.1016/j.csl.2020.101151