Abstract
To avoid detection by current NLP monitoring applications, progenitors of hate speech often replace one or more letters in offensive words with homoglyphs, visually similar Unicode characters. Harvesting real-world hate speech containing homoglyphs is challenging due to the vast replacement possibilities. We developed a character substitution scraping method and assembled the Offensive Tweets with Homoglyphs (OTH) Dataset (N=90,788) with more than 1.5 million occurrences of 1,281 non-Latin characters (emojis excluded). In an annotated sample (n=700), 40.14% of the tweets were found to contain hate speech. We assessed the performance of seven transformer-based hate speech detection models and found that they performed poorly in a zero-shot setting (F1 scores between 0.04 and 0.52) but normalizing the data dramatically improved detection (F1 scores between 0.59 and 0.71). Training the models using the annotated data further boosted performance (highest micro-averaged F1 score=0.88, using five-fold cross validation). This study indicates that a dataset containing homoglyphs known and unknown to the scraping script can be collected, and that neural models can be trained to recognize camouflaged real-world hate speech.- Anthology ID:
- 2023.findings-emnlp.192
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2922–2929
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.192
- DOI:
- 10.18653/v1/2023.findings-emnlp.192
- Cite (ACL):
- Portia Cooper, Mihai Surdeanu, and Eduardo Blanco. 2023. Hiding in Plain Sight: Tweets with Hate Speech Masked by Homoglyphs. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2922–2929, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Hiding in Plain Sight: Tweets with Hate Speech Masked by Homoglyphs (Cooper et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/fix-volume-bibkeys/2023.findings-emnlp.192.pdf