ReproHum #0866-04: Variability in Human Judgments of Sociopolitical Acceptability Across Studies

Rui Fan; Guanyi Chen

ReproHum #0866-04: Variability in Human Judgments of Sociopolitical Acceptability Across Studies

Abstract

Human evaluations are essential for assessing NLP systems, but their reproducibility can be limited when judgments involve socially sensitive constructs. This paper reproduces the perceived sociopolitical acceptability evaluation in (CITATION), where annotators judged whether model-generated writer-intent implications reflected mainstream or fringe viewpoints. Using the same 600 headline–belief pairs, we collected new annotations on Prolific and compared our results with both the original study and a prior reproduction. Our scores are lower than the original results. Under a 70% threshold, these findings do not support the original conclusion that most generations were socially acceptable. Overall, our results align more closely with the prior reproduction, while also showing substantial variability, especially for GPT2-large. We argue that this variability may arise from a combination of platform differences, task framing, topic effects, and changes in social context over time. These findings highlight the importance of reporting not only annotation results, but also the evaluation setting in which subjective social judgments are collected.

Anthology ID:: 2026.gem-main.87
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1104–1110
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.87/
DOI:
Bibkey:
Cite (ACL):: Rui Fan and Guanyi Chen. 2026. ReproHum #0866-04: Variability in Human Judgments of Sociopolitical Acceptability Across Studies. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 1104–1110, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: ReproHum #0866-04: Variability in Human Judgments of Sociopolitical Acceptability Across Studies (Fan & Chen, GEM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.87.pdf

PDF Cite Search Fix data