Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech?

Jingjie Zeng, Liang Yang, Zekun Wang, Yuanyuan Sun, Hongfei Lin


Abstract
Implicit hate speech has become a significant challenge for online platforms, as it often avoids detection by large language models (LLMs) due to its indirectly expressed hateful intent. This study identifies the limitations of LLMs in detecting implicit hate speech, particularly when disguised as seemingly harmless expressions in a rhetorical device. To address this challenge, we employ a Jailbreaking strategy and Energy-based Constrained Decoding techniques, and design a small model for measuring the energy of metaphorical rhetoric. This approach can lead to LLMs generating metaphorical implicit hate speech. Our research reveals that advanced LLMs, like GPT-4o, frequently misinterpret metaphorical implicit hate speech, and fail to prevent its propagation effectively. Even specialized models, like ShieldGemma and LlamaGuard, demonstrate inadequacies in blocking such content, often misclassifying it as harmless speech. This work points out the vulnerability of current LLMs to implicit hate speech, and emphasizes the improvements to address hate speech threats better.
Anthology ID:
2025.acl-long.814
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16657–16677
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.814/
DOI:
Bibkey:
Cite (ACL):
Jingjie Zeng, Liang Yang, Zekun Wang, Yuanyuan Sun, and Hongfei Lin. 2025. Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech?. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16657–16677, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech? (Zeng et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.814.pdf