Using LLMs and Preference Optimization for Agreement-Aware HateWiC Classification
Sebastian Loftus, Adrian Mülthaler, Sanne Hoeken, Sina Zarrieß, Ozge Alacam
Abstract
Annotator disagreement poses a significant challenge in subjective tasks like hate speech detection. In this paper, we introduce a novel variant of the HateWiC task that explicitly models annotator agreement by estimating the proportion of annotators who classify the meaning of a term as hateful. To tackle this challenge, we explore the use of Llama 3 models fine-tuned through Direct Preference Optimization (DPO). Our experiments show that while LLMs perform well for majority-based hate classification, they struggle with the more complex agreement-aware task. DPO fine-tuning offers improvements, particularly when applied to instruction-tuned models. Yet, our results emphasize the need for improved modeling of subjectivity in hate classification and this study can serve as foundation for future advancements.- Anthology ID:
- 2025.woah-1.47
- Volume:
- Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)
- Month:
- August
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Agostina Calabrese, Christine de Kock, Debora Nozza, Flor Miriam Plaza-del-Arco, Zeerak Talat, Francielle Vargas
- Venues:
- WOAH | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 538–547
- Language:
- URL:
- https://preview.aclanthology.org/landing_page/2025.woah-1.47/
- DOI:
- Cite (ACL):
- Sebastian Loftus, Adrian Mülthaler, Sanne Hoeken, Sina Zarrieß, and Ozge Alacam. 2025. Using LLMs and Preference Optimization for Agreement-Aware HateWiC Classification. In Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH), pages 538–547, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Using LLMs and Preference Optimization for Agreement-Aware HateWiC Classification (Loftus et al., WOAH 2025)
- PDF:
- https://preview.aclanthology.org/landing_page/2025.woah-1.47.pdf