HEDS Form

Download to file
download json

Press the button to download your current form in JSON format.
Upload from file


upload json

Press the button to upload a JSON file. Warning: This will clear your current form completely then upload the contents from the file.
Count of errors
Updates every 60 seconds.
42 blank fields.

Instructions

This is the Human Evaluation Datasheet (HEDS) form. Within each section there are questions about the human evaluation experiment for which details are being recorded. There can be multiple subsections within each section and each can be expanded or collapsed.

This form is not submitted to any server when it is completed, instead please use the "download json" button in the "Download to file" section. This will download a file (in .json format) that contains the current values from each form field. You can also upload a json file (see the "Upload from file" section" on the left of the screen). Warning: This will delete your current form content, then populate the blank form with content from the file. It is advisable to download files as a backup when you are compelting the form. The form saves the field values in local storage of your browser, it will be deleted if you clear the local storage, or if you are in a private/incognito window and then close it.

The form will not prevent you from downloading your save file, even when there are error or warning messages. Yellow warning messages indicate fields that have not been completed. If a field is not relevant for your experiment, enter N/A, and ideally also explain why. Red messages are errors, for example if the form expects an integer and you have entered something else, a red message will be shown. These will still not prevent you from saving the form.

You can generate a list of all current errors/warnings, along with their section numbers, in the "all form errors" tab at the bottom of the form. A count of errors will also be refreshed every 60 seconds on the panel on the left side of the screen.

Section 4 should be completed for each criterion that is evaluated in the experiment. Instructions on how to do this are shown when at the start of the section.


Credits
Questions 2.1–2.5 relating to evaluated system, and 4.3.1–4.3.8 relating to response elicitation, are based on Howcroft et al. (2020), with some significant changes. Questions 4.1.1–4.2.3 relating to quality criteria, and some of the questions about system outputs, evaluators, and experimental design (3.1.1–3.2.3, 4.3.5, 4.3.6, 4.3.9–4.3.11) are based on Belz et al. (2020). HEDS was also informed by van der Lee et al. (2019, 2021) and by Gehrmann et al. (2021)’s[6] data card guide. More generally, the original inspiration for creating a ‘datasheet’ for describing human evaluation experiments of course comes from seminal papers by Bender & Friedman (2018), Mitchell et al. (2019) and Gebru et al. (2020). References
References
Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., Knight, K., Koehn, P., Palmer, M., & Schneider, N. (2013). Abstract Meaning Representation for sembanking. Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, 178–186. https://www.aclweb.org/anthology/W13-2322

Belz, A., Mille, S., & Howcroft, D. M. (2020). Disentangling the properties of human evaluation methods: A classification system to support comparability, meta-evaluation and reproducibility testing. Proceedings of the 13th International Conference on Natural Language Generation, 183–194.

Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587–604. https://doi.org/10.1162/tacl_a_00041

Card, D., Henderson, P., Khandelwal, U., Jia, R., Mahowald, K., & Jurafsky, D. (2020). With little power comes great responsibility. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (Emnlp), 9263–9274. https://doi.org/10.18653/v1/2020.emnlp-main.745

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford, K. (2020). Datasheets for datasets. http://arxiv.org/abs/1803.09010

Gehrmann, S., Adewumi, T., Aggarwal, K., Ammanamanchi, P. S., Anuoluwapo, A., Bosselut, A., Chandu, K. R., Clinciu, M., Das, D., Dhole, K. D., Du, W., Durmus, E., Dušek, O., Emezue, C., Gangal, V., Garbacea, C., Hashimoto, T., Hou, Y., Jernite, Y., … Zhou, J. (2021). The GEM benchmark: Natural language generation, its evaluation and metrics. http://arxiv.org/abs/2102.01672

Howcroft, D. M., Belz, A., Clinciu, M.-A., Gkatzia, D., Hasan, S. A., Mahamood, S., Mille, S., Miltenburg, E. van, Santhanam, S., & Rieser, V. (2020). Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions. Proceedings of the 13th International Conference on Natural Language Generation, 169–182. https://www.aclweb.org/anthology/2020.inlg-1.23

Howcroft, D. M., & Rieser, V. (2021). What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 8932–8939. https://doi.org/10.18653/v1/2021.emnlp-main.703

Kamp, H., & Reyle, U. (2013). From discourse to logic: Introduction to modeltheoretic semantics of natural language, formal logic and discourse representation theory (Vol. 42). Springer Science & Business Media.

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–229. https://doi.org/10.1145/3287560.3287596

Shimorina, A., & Belz, A. (2022). The human evaluation datasheet: A template for recording details of human evaluation experiments in NLP. Proceedings of the 2nd Workshop on Human Evaluation of Nlp Systems (Humeval), 54–75. https://aclanthology.org/2022.humeval-1.6

van der Lee, C., Gatt, A., Miltenburg, E. van, Wubben, S., & Krahmer, E. (2019). Best practices for the human evaluation of automatically generated text. Proceedings of the 12th International Conference on Natural Language Generation, 355–368. https://www.aclweb.org/anthology/W19-8643.pdf

van der Lee, C., Gatt, A., van Miltenburg, E., & Krahmer, E. (2021). Human evaluation of automatically generated text: Current trends and best practice guidelines. Computer Speech & Language, 67, 101151. https://doi.org/10.1016/j.csl.2020.101151

Sections 1.1–1.3 record bibliographic and related information. These are straightforward and don’t warrant much in-depth explanation.



Enter a link to an online copy of the the main reference (e.g., a paper) for the human evaluation experiment. If the experiment hasn’t been run yet, and the form is being completed for the purpose of submitting it for preregistration, simply enter ‘for preregistration’.


Enter details of the experiment within the paper for which this sheet is being completed. For example, the title of the experiment and/or a section number. If there is only one human human evaluation, still enter the same information. If this is form is being completed for pre-registration, enter a note that differetiates this experiment from any others that you are carrying out as part of the same overall work.

1.1.2:  Please complete this question.


Enter the link(s). Such resources include system outputs, evaluation tools, etc. If there aren’t any publicly shared resources (yet), enter ‘N/A’.


This section records the name, affiliation, and email address of person completing this sheet, and of the contact author if different.



Enter the name of the person completing this sheet.

1.3.1.1:  Please complete this question.

Enter the affiliation of the person completing this sheet.

1.3.1.2:  Please complete this question.

Enter the email address of the person completing this sheet.

1.3.1.3:  Please complete this question.


Enter the name of the contact author, enter N/A if it is the same person as in Question 1.3.1.1

1.3.2.1:  Please complete this question.

Enter the affiliation of the contact author, enter N/A if it is the same person as in Question 1.3.1.2

1.3.2.2:  Please complete this question.

Enter the email address of the contact author, enter N/A if it is the same person as in Question 1.3.1.3

1.3.2.3:  Please complete this question.

Questions 2.1–2.5 record information about the system(s) (or human-authored stand-ins) whose outputs are evaluated in the Evaluation experiment that this sheet is being completed for. The input, output, and task questions in this section are closely interrelated: the value for one partially determines the others,as indicated for some combinations in Question 2.3.


Question 2.1:  What type of input do the evaluated system(s) take?

This question is about the type(s) of input, where input refers to the representations and/or data structures shared by all evaluated systems. This question is about input type, regardless of number. E.g. if the input is a set of documents, you would still select text: document below.

Select all that apply. If none match, select ‘other’ and describe.

Please provide further details for your above selection(s)
2.1:  Please select at least 1 of the above options.

Question 2.2:  What type of output do the evaluated system(s) generate?

This question is about the type(s) of output, where output refers to the and/or data structures shared by all evaluated systems. This question is about output type, regardless of number. E.g. if the output is a set of documents, you would still select text: document below. Note that the options for outputs are the same as for inputs except that the no input (human generation) option is replaced with human-generated ‘outputs’, and the control feature option is removed.

Select all that apply. If none match, select ‘other’ and describe.

Please provide further details for your above selection(s)
2.2:  Please select at least 1 of the above options.

Question 2.3:  How would you describe the task that the evaluated system(s) perform in mapping the inputs in Q2.1 to the outputs in Q2.2?

This question is about the task(s) performed by the system(s) being evaluated. This is independent of the application domain (financial reporting, weather forecasting, etc.), or the specific method (rule-based, neural, etc.) implemented in the system. We indicate mutual constraints between inputs, outputs and task for some of the options below.

Occasionally, more than one of the options below may apply. Select all that apply. If none match, select ‘other’ and describe.

Please provide further details for your above selection(s)
2.3:  Please select at least 1 of the above options.

Question 2.4:  What are the input languages that are used by the system?

This question is about the language(s) of the inputs accepted by the system(s) being evaluated. Select any language name(s) that apply, mapped to standardised full language names in ISO 639-1 (2019). E.g. English, Herero, Hindi. If no language is accepted as (part of) the input, select ‘N/A’.

Select all that apply. If any languages you are using are not covered by this list, select ‘other’ and describe.

Please provide further details for your above selection(s)
2.4:  Please select at least 1 of the above options.

Question 2.5:  What are the output languages that are used by the system?

This field question the language(s) of the outputs generated by the system(s) being evaluated. Select any language name(s) that apply, mapped to standardised full language names in ISO 639-1 (2019). E.g. English, Herero, Hindi. If no language is generated, select ‘N/A’.

Select all that apply. If any languages you are using are not covered by this list, select ‘other’ and describe.

Please provide further details for your above selection(s)
2.5:  Please select at least 1 of the above options.


Questions 3.1.1–3.1.3 record information about the size of the sample of outputs (or human-authored stand-ins) evaluated per system, how the sample was selected, and what its statistical power is.


Enter the number of system outputs (or other evaluation items) that are evaluated per system by at least one evaluator in the experiment. For most experiments this should be an integer, although if the number of outputs varies please provide further details here.

3.1.1:  Please complete this question.

Question 3.1.2:  How are system outputs (or other evaluation items) selected for inclusion in the evaluation experiment?

Select one option. If none match, select ‘other’ and describe:

Please provide further details for your above selection(s)
3.1.2:  Please select at least 1 of the above options.


Enter the name of the method used.

3.1.3.1:  Please complete this question.

Enter the numerical results of a statistical power calculation on the output sample.

3.1.3.2:  Please complete this question.

Enter a link to the script used (or another way of identifying the script). See, e.g., Card et al. (2020), Howcroft & Rieser (2021).

3.1.3.3:  Please complete this question.

Questions 3.2.1–3.2.5 record information about the evaluators participating in the experiment.


Enter the total number of evaluators participating in the experiment, as an integer.

3.2.1:  Please complete this question.

Questions 3.2.2.1–3.2.2.5 record information about the type of evaluators participating in the experiment.


Question 3.2.2.1:  What kind of evaluators are in this experiment?

Select one option. These options should be valid for most experiments, but if not, select ‘N/A’ and describe why:

Please provide further details for your above selection(s)
3.2.2.1:  Please select at least 1 of the above options.

Question 3.2.2.2:  Were the participants paid or unpaid?

Select one option. These options should be valid for most experiments, but if not, select ‘N/A’ and describe why:

Please provide further details for your above selection(s)
3.2.2.2:  Please select at least 1 of the above options.

Question 3.2.2.3:  Were the participants previously known to the authors?

Select one option. These options should be valid for most experiments, but if not, select ‘N/A’ and describe why:

Please provide further details for your above selection(s)
3.2.2.3:  Please select at least 1 of the above options.

Question 3.2.2.4:  Were one or more of the authors among the participants?

Select one option. These options should be valid for most experiments, but if not, select ‘N/A’ and describe why:

Please provide further details for your above selection(s)
3.2.2.4:  Please select at least 1 of the above options.

Please use this field to elaborate on your selections for questions 3.2.2.1 to 3.2.2.4 above.

3.2.2.5:  Please complete this question.

Please explain how your evaluators are recruited. Do you send emails to a given list? Do you post invitations on social media? Posters on university walls? Were there any gatekeepers involved? What are the exclusion/inclusion criteria?

3.2.3:  Please complete this question.

Use this space to describe any training evaluators were given as part of the experiment to prepare them for the evaluation task, including any practice evaluations they did. This includes any introductory explanations they’re given, e.g. on the start page of an online evaluation tool.

3.2.4:  Please complete this question.

Known either because these were qualifying criteria, or from information gathered as part of the evaluation.

Use this space to list any characteristics not covered in previous questions that the evaluators are known to have, either because evaluators were selected on the basis of a characteristic, or because information about a characteristic was collected as part of the evaluation. This might include geographic location of IP address, educational level, or demographic information such as gender, age, etc. Where characteristics differ among evaluators (e.g. gender, age, location etc.), also give numbers for each subgroup.

3.2.5:  Please complete this question.

Sections 3.3.1–3.3.8 record information about the experimental design of the evaluation experiment.


Question 3.3.1:  Has the experimental design been preregistered? If yes, on which registry?

Select ‘Yes’ or ‘No’; if ‘Yes’ also give the name of the registry and a link to the registration page for the experiment.


Please provide further details for your above selection(s)
3.3.1:  Please select at least 1 of the above options.

Describe here the method used to collect responses, e.g. paper forms, Google forms, SurveyMonkey, Mechanical Turk, CrowdFlower, audio/video recording, etc.

3.3.2:  Please complete this question.

Questions 3.3.3.1 and 3.3.3.2 record information about quality assurance.


Question 3.3.3.1:  What quality assurance methods are used to ensure evaluators and/or their responses are suitable?

If any methods other than those listed were used, select ‘other’, and describe why below. If no methods were used, select none of the above and enter ‘No Method’

Select all that apply:

Please provide further details for your above selection(s)
3.3.3.1:  Please select at least 1 of the above options.

If no methods were used, enter ‘N/A’

3.3.3.2:  Please complete this question.

Questions 3.3.4.1 and 3.4.3.2 record information about the form or user interface that was shown to participants.


Please record a link to a screenshot or copy of the form if possible. If there are many files, please create a signpost page (e.g., on GitHub that contains links to all applicable resouces). If there is a separate introductory interface/page, include it under Question 3.2.4.


Describe what evaluators are shown, in addition to providing the links in 3.3.4.1.

3.3.4.2:  Please complete this question.

Question 3.3.5:  How free are evaluators regarding when and how quickly to carry out evaluations?

Select all that apply:

Please provide further details for your above selection(s)
3.3.5:  Please select at least 1 of the above options.

Question 3.3.6:  Are evaluators told they can ask questions about the evaluation and/or provide feedback?

Select all that apply.

Please provide further details for your above selection(s)
3.3.6:  Please select at least 1 of the above options.

Question 3.3.7:  What are the experimental conditions in which evaluators carry out the evaluations?

Multiple-choice options (select one). If none match, select ‘other’ and describe.

Please provide further details for your above selection(s)
3.3.7:  Please select at least 1 of the above options.

Use this space to describe the variations in the conditions in which evaluators carry out the evaluation, for both situations where those variations are controlled,and situations where they are not controlled. If the evaluation is carried out at a place of the evaluators’ own choosing, enter ‘N/A’

3.3.8:  Please complete this question.

Questions in this section collect information about each quality criterion assessed in the single human evaluation experiment that this sheet is being completed for.


In this section you can create named subsections for each criterion that is being evaluated. The form is then duplicated for each criterion. To create a criterion type its name in the field and press the New button, it will then appear on tab that will allow you to toggle the active criterion. To delete the current criterion press the Delete current button.



Questions 4.1.1–4.1.3 capture the aspect of quality that is assessed by a given quality criterion in terms of three orthogonal properties. They help determine whether or not the same aspect of quality is being evaluated in different evaluation experiments. The three properties characterise quality criteria in terms of (i) what type of quality is being assessed; (ii) what aspect of the system output is being assessed; and (iii) whether system outputs are assessed in their own right or with reference to some system-internal or system-external frame of reference. For full explanations see Belz et al. (2020).


Question 4.1.1:  What type of quality is assessed by the quality criterion?


Please provide further details for your above selection(s)

Question 4.1.2:  Which aspect of system outputs is assessed by the quality criterion?


Please provide further details for your above selection(s)

Question 4.1.3:  Is each output assessed for quality in its own right, or with reference to a system-internal or external frame of reference?


Please provide further details for your above selection(s)

Questions 4.2.1–4.2.3 record properties that are orthogonal to quality criteria (covered by questions in the preceding section), i.e. any given quality criterion can in principle be combined with any of the modes (although some combinations are more common than others).


Question 4.2.1:  Does an individual assessment involve an objective or a subjective judgment?


Please provide further details for your above selection(s)

Question 4.2.2:  Are outputs assessed in absolute or relative terms?


Please provide further details for your above selection(s)

Question 4.2.3:  Is the evaluation intrinsic or extrinsic?


Please provide further details for your above selection(s)

The questions in this section concern response elicitation, by which we mean how the ratings or other measurements that represent assessments for the quality criterion in question are obtained, covering what is presented to evaluators, how they select response and via what type of tool, etc. The eleven questions (4.3.1–4.3.11) are based on the information annotated in the large scale survey of human evaluation methods in NLG by Howcroft et al. (2020).


the name you use to refer to the quality criterion in explanations and/or interfaces created for evaluators. Examples of quality criterion names include Fluency, Clarity, Meaning Preservation. If no name is used, state ‘N/A’.


Copy and past the verbatim definition you give to evaluators to explain the quality criterion they’re assessing. If you don’t explicitly call it a definition, enter the nearest thing to a definition you give them. If you don’t give any definition, state ‘N/A’.


Question 4.3.3:  Are the rating instrument response values discrete or continuous? If so, please also indicate the size.

Is the rating instrument discrete or continuous? When discrete, also record the number of different response values for this quality criterion. E.g. for a 5-point Likert scale, select Discrete and record the size as 5 in the box below. For two-way forced-choice preference judgments, the size would be 2; if there’s also a no-preference option, enter 3. For a slider that is mapped to 100 different values for the purpose of recording assessments select discrete and record the size as 100. If no rating instrument is used (e.g. when evaluation gathers post-edits or qualitative feedback only), select N/A.


Please provide further details for your above selection(s)

list, or give the range of, the possible values of the rating instrument. The list or range should be of the size specified in Question 4.3.3. If there are too many to list, use a range. E.g. for two-way forced-choice preference judgments, the list entered might be A better, B better; if there’s also a no-preference option, the list might be A better, B better, neither. For a slider that is mapped to 100 different values for the purpose of recording assessments, the range 1–100 might be entered. If no rating instrument is used (e.g. when evaluation gathers post-edits or qualitative feedback only), enter ’N/A’.


Question 4.3.5:  How is the scale or other rating instrument presented to evaluators? If none match, select ‘Other’ and describe.


Please provide further details for your above selection(s)

If (and only if) there is no rating instrument, i.e. you entered ‘N/A’ for Questions 4.3.3–4.3.5, describe the task evaluators perform in this space. Otherwise, here enter ‘N/A’ if there is a rating instrument.


Copy and paste the verbatim text that evaluators see during each assessment, that is intended to convey the evaluation task to them. E.g. Which of these texts do you prefer? Or Make any corrections to this text that you think are necessary in order to improve it to the point where you would be happy to provide it to a client.


Question 4.3.8:  Form of response elicitation. If none match, select ‘Other’ and describe.

Explanations adapted from Howcroft et al. (2020).


Please provide further details for your above selection(s)

Normally a set of separate assessments is collected from evaluators and is converted to the results as reported. Describe here the method(s) used in the conversion(s). E.g. macro-averages or micro-averages are computed from numerical scores to provide summary, per-system results. If no such method was used, enter ’N/A’.


Enter a list of methods used for calculating the effect size and significance of any results, both as reported in the paper given in Question 1.1, for this quality criterion. If none calculated, state ‘None’.


Questions 4.3.11.1 and 4.3.11.2 record information about inter-annotator agreement.


Question 4.3.11.1:  Has the inter-annotator agreement between evaluators for this quality criterion been measured? If yes, what method was used?

Select one option. If Yes, enter the methods used to compute any measures of inter-annotator agreement obtained for the quality criterion. If N/A, explain why.


Please provide further details for your above selection(s)

Enter N/A if there was none.


Questions 4.3.12.1 and 4.3.12.2 record information about intra-annotator agreement.


Question 4.3.12.1:  Has the intra-annotator agreement between evaluators for this quality criterion been measured? If yes, what method was used?

Select one option. If Yes, enter the methods used to compute any measures of intra-annotator agreement obtained for the quality criterion. If N/A, explain why.


Please provide further details for your above selection(s)

Enter N/A if there was none.


The questions in this section relate to ethical aspects of the evaluation. Information can be entered in the text box provided, and/or by linking to a source where complete information can be found.


Typically, research organisations, universities and other higher-education institutions require some form ethical approval before experiments involving human participants, however innocuous, are permitted to proceed. Please provide here the name of the body that approved the experiment, or state ‘No’ if approval has not (yet) been obtained.

5.1:  Please complete this question.

State ‘No’ if no personal data as defined by GDPR was recorded or collected, otherwise explain how conformity with GDPR requirements such as privacy and security was ensured, e.g. by linking to the (successful) application for ethics approval from Question 5.1.

5.2:  Please complete this question.

State ‘No’ if no special-category data as defined by GDPR was recorded or collected, otherwise explain how conformity with GDPR requirements relating to special-category data was ensured, e.g. by linking to the (successful) application for ethics approval from Question 5.1.

5.3:  Please complete this question.

Use this box to describe any ex ante or ex post impact assessments that have been carried out in relation to the evaluation experiment, such that the assessment plan and process, well as the outcomes, were captured in written form. Link to documents if possible. Types of impact assessment include data protection impact assessments, e.g. under GDPR. Environmental and social impact assessment frameworks are also available.

5.4:  Please complete this question.

List of all errors
refresh list of all errors

Press the button to refresh the list of all errors.