@inproceedings{castilho-2020-page,
    title = "On the Same Page? Comparing Inter-Annotator Agreement in Sentence and Document Level Human Machine Translation Evaluation",
    author = "Castilho, Sheila",
    editor = {Barrault, Lo{\"i}c  and
      Bojar, Ond{\v{r}}ej  and
      Bougares, Fethi  and
      Chatterjee, Rajen  and
      Costa-juss{\`a}, Marta R.  and
      Federmann, Christian  and
      Fishel, Mark  and
      Fraser, Alexander  and
      Graham, Yvette  and
      Guzman, Paco  and
      Haddow, Barry  and
      Huck, Matthias  and
      Yepes, Antonio Jimeno  and
      Koehn, Philipp  and
      Martins, Andr{\'e}  and
      Morishita, Makoto  and
      Monz, Christof  and
      Nagata, Masaaki  and
      Nakazawa, Toshiaki  and
      Negri, Matteo},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2020.wmt-1.137/",
    pages = "1150--1159",
    abstract = {Document-level evaluation of machine translation has raised interest in the community especially since responses to the claims of ``human parity'' (Toral et al., 2018; L{\"a}ubli et al., 2018) with document-level human evaluations have been published. Yet, little is known about best practices regarding human evaluation of machine translation at the document-level. This paper presents a comparison of the differences in inter-annotator agreement between quality assessments using sentence and document-level set-ups. We report results of the agreement between professional translators for fluency and adequacy scales, error annotation, and pair-wise ranking, along with the effort needed to perform the different tasks. To best of our knowledge, this is the first study of its kind.}
}