Hannes Gröner
2025
Disentangling Subjectivity and Uncertainty for Hate Speech Annotation and Modeling using Gaze
Özge Alacam
|
Sanne Hoeken
|
Andreas Säuberli
|
Hannes Gröner
|
Diego Frassinelli
|
Sina Zarrieß
|
Barbara Plank
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Variation is inherent in opinion-based annotation tasks like sentiment or hate speech analysis. It does not only arise from errors, fatigue, or sentence ambiguity but also from genuine differences in opinion shaped by background, experience, and culture. In this paper, first, we show how annotators’ confidence ratings can be great use for disentangling subjective variation from uncertainty, without relying on specific features present in the data (text, gaze, etc.). Our goal is to establish distinctive dimensions of variation which are often not clearly separated in existing work on modeling annotator variation. We illustrate our approach through a hate speech detection task, demonstrating that models are affected differently by instances of uncertainty and subjectivity. In addition, we show that human gaze patterns offer valuable indicators of subjective evaluation and uncertainty. Disclaimer: This paper contains sentences that may be offensive.
2024
Evaluating Diversity in Automatic Poetry Generation
Yanran Chen
|
Hannes Gröner
|
Sina Zarrieß
|
Steffen Eger
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Natural Language Generation (NLG), and more generally generative AI, are among the currently most impactful research fields. Creative NLG, such as automatic poetry generation, is a fascinating niche in this area. While most previous research has focused on forms of the Turing test when evaluating automatic poetry generation — can humans distinguish between automatic and human generated poetry — we evaluate the diversity of automatically generated poetry (with a focus on quatrains), by comparing distributions of generated poetry to distributions of human poetry along structural, lexical, semantic and stylistic dimensions, assessing different model types (word vs. character-level, general purpose LLMs vs. poetry-specific models), including the very recent LLaMA3-8B, and types of fine-tuning (conditioned vs. unconditioned). We find that current automatic poetry systems are considerably underdiverse along multiple dimensions — they often do not rhyme sufficiently, are semantically too uniform and even do not match the length distribution of human poetry. Our experiments reveal, however, that style-conditioning and character-level modeling clearly increases diversity across virtually all dimensions we explore. Our identified limitations may serve as the basis for more genuinely diverse future poetry generation models.
Search
Fix author
Co-authors
- Sina Zarrieß 2
- Özge Alaçam 1
- Yanran Chen 1
- Steffen Eger 1
- Diego Frassinelli 1
- show all...