“That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks

Edoardo Mosca; Shreyash Agarwal; Javier Rando Ramírez; Georg Groh

doi:10.18653/v1/2022.acl-long.538

“That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks

Edoardo Mosca, Shreyash Agarwal, Javier Rando Ramírez, Georg Groh

Abstract

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

Anthology ID:: 2022.acl-long.538
Volume:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7806–7816
Language:
URL:: https://aclanthology.org/2022.acl-long.538
DOI:: 10.18653/v1/2022.acl-long.538
Bibkey:
Cite (ACL):: Edoardo Mosca, Shreyash Agarwal, Javier Rando Ramírez, and Georg Groh. 2022. “That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7806–7816, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: “That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks (Mosca et al., ACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/paclic-22-ingestion/2022.acl-long.538.pdf
Software:: 2022.acl-long.538.software.zip
Video:: https://preview.aclanthology.org/paclic-22-ingestion/2022.acl-long.538.mp4
Data: AG News, IMDb Movie Reviews

PDF Search Software Video