Alexander Sahn


2022

pdf
The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism
Pratik Sachdeva | Renata Barreto | Geoff Bacon | Alexander Sahn | Claudia von Vacano | Chris Kennedy
Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022

We introduce the Measuring Hate Speech corpus, a dataset created to measure hate speech while adjusting for annotators’ perspectives. It consists of 50,070 social media comments spanning YouTube, Reddit, and Twitter, labeled by 11,143 annotators recruited from Amazon Mechanical Turk. Each observation includes 10 ordinal labels: sentiment, disrespect, insult, attacking/defending, humiliation, inferior/superior status, dehumanization, violence, genocide, and a 3-valued hate speech benchmark label. The labels are aggregated using faceted Rasch measurement theory (RMT) into a continuous score that measures each comment’s location on a hate speech spectrum. The annotation experimental design assigned comments to multiple annotators in order to yield a linked network, allowing annotator disagreement (perspective) to be statistically summarized. Annotators’ labeling strictness was estimated during the RMT scaling, projecting their perspective onto a linear measure that was adjusted for the hate speech score. Models that incorporate this annotator perspective parameter as an auxiliary input can generate label- and score-level predictions conditional on annotator perspective. The corpus includes the identity group targets of each comment (8 groups, 42 subgroups) and annotator demographics (6 groups, 40 subgroups), facilitating analyses of interactions between annotator- and comment-level identities, i.e. identity-related annotator perspective.