Abstract
Automated filters are commonly used by online services to stop users from sending age-inappropriate, bullying messages, or asking others to expose personal information. Previous work has focused on rules or classifiers to detect and filter offensive messages, but these are vulnerable to cleverly disguised plaintext and unseen expressions especially in an adversarial setting where the users can repeatedly try to bypass the filter. In this paper, we model the disguised messages as if they are produced by encrypting the original message using an invented cipher. We apply automatic decipherment techniques to decode the disguised malicious text, which can be then filtered using rules or classifiers. We provide experimental results on three different datasets and show that decipherment is an effective tool for this task.- Anthology ID:
- W18-5119
- Volume:
- Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
- Month:
- October
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Darja Fišer, Ruihong Huang, Vinodkumar Prabhakaran, Rob Voigt, Zeerak Waseem, Jacqueline Wernimont
- Venue:
- ALW
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 149–159
- Language:
- URL:
- https://aclanthology.org/W18-5119
- DOI:
- 10.18653/v1/W18-5119
- Cite (ACL):
- Zhelun Wu, Nishant Kambhatla, and Anoop Sarkar. 2018. Decipherment for Adversarial Offensive Language Detection. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 149–159, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- Decipherment for Adversarial Offensive Language Detection (Wu et al., ALW 2018)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/W18-5119.pdf