Using LLMs and Preference Optimization for Agreement-Aware HateWiC Classification

Sebastian Loftus, Adrian Mülthaler, Sanne Hoeken, Sina Zarrieß, Ozge Alacam


Abstract
Annotator disagreement poses a significant challenge in subjective tasks like hate speech detection. In this paper, we introduce a novel variant of the HateWiC task that explicitly models annotator agreement by estimating the proportion of annotators who classify the meaning of a term as hateful. To tackle this challenge, we explore the use of Llama 3 models fine-tuned through Direct Preference Optimization (DPO). Our experiments show that while LLMs perform well for majority-based hate classification, they struggle with the more complex agreement-aware task. DPO fine-tuning offers improvements, particularly when applied to instruction-tuned models. Yet, our results emphasize the need for improved modeling of subjectivity in hate classification and this study can serve as foundation for future advancements.
Anthology ID:
2025.woah-1.47
Volume:
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)
Month:
August
Year:
2025
Address:
Vienna, Austria
Editors:
Agostina Calabrese, Christine de Kock, Debora Nozza, Flor Miriam Plaza-del-Arco, Zeerak Talat, Francielle Vargas
Venues:
WOAH | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
538–547
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.woah-1.47/
DOI:
Bibkey:
Cite (ACL):
Sebastian Loftus, Adrian Mülthaler, Sanne Hoeken, Sina Zarrieß, and Ozge Alacam. 2025. Using LLMs and Preference Optimization for Agreement-Aware HateWiC Classification. In Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH), pages 538–547, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Using LLMs and Preference Optimization for Agreement-Aware HateWiC Classification (Loftus et al., WOAH 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.woah-1.47.pdf