Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models
Atharvan Dogra, Soumya Suvra Ghosal, Ameet Deshpande, Ashwin Kalyan, Dinesh Manocha
Abstract
Large language models are increasingly used for creative writing and engagement content, raising safety concerns about their outputs. Using humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. We further supplement this by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that even for fixed neutral setups, harmful outputs receive higher humor scores, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show that harmful cues widen predictive uncertainty and, surprisingly, can even make harmful punchlines more expected for some models, suggesting intrinsic structural embedding in learned humor distributions. Experiments and human evaluation on an additional satire-generation task with human-perceived funniness judgments show that LLM funniness relies on increased stereotypicality and toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain 10%–21% in mean humor score, stereotypical jokes appear 11% to 28% more often among the jokes marked funny by an LLM-based metric, and up to 10% more often in generations perceived as funny by humans.- Anthology ID:
- 2026.eacl-long.373
- Volume:
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7971–7990
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.373/
- DOI:
- Cite (ACL):
- Atharvan Dogra, Soumya Suvra Ghosal, Ameet Deshpande, Ashwin Kalyan, and Dinesh Manocha. 2026. Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7971–7990, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models (Dogra et al., EACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.373.pdf