Decipherment for Adversarial Offensive Language Detection

Zhelun Wu, Nishant Kambhatla, Anoop Sarkar


Abstract
Automated filters are commonly used by online services to stop users from sending age-inappropriate, bullying messages, or asking others to expose personal information. Previous work has focused on rules or classifiers to detect and filter offensive messages, but these are vulnerable to cleverly disguised plaintext and unseen expressions especially in an adversarial setting where the users can repeatedly try to bypass the filter. In this paper, we model the disguised messages as if they are produced by encrypting the original message using an invented cipher. We apply automatic decipherment techniques to decode the disguised malicious text, which can be then filtered using rules or classifiers. We provide experimental results on three different datasets and show that decipherment is an effective tool for this task.
Anthology ID:
W18-5119
Volume:
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
Month:
October
Year:
2018
Address:
Brussels, Belgium
Venue:
ALW
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
149–159
Language:
URL:
https://aclanthology.org/W18-5119
DOI:
10.18653/v1/W18-5119
Bibkey:
Cite (ACL):
Zhelun Wu, Nishant Kambhatla, and Anoop Sarkar. 2018. Decipherment for Adversarial Offensive Language Detection. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 149–159, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Decipherment for Adversarial Offensive Language Detection (Wu et al., ALW 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/W18-5119.pdf