From Hero to Zéroe: A Benchmark of Low-Level Adversarial Attacks

Steffen Eger, Yannik Benz


Abstract
Adversarial attacks are label-preserving modifications to inputs of machine learning classifiers designed to fool machines but not humans. Natural Language Processing (NLP) has mostly focused on high-level attack scenarios such as paraphrasing input texts. We argue that these are less realistic in typical application scenarios such as in social media, and instead focus on low-level attacks on the character-level. Guided by human cognitive abilities and human robustness, we propose the first large-scale catalogue and benchmark of low-level adversarial attacks, which we dub Zéroe, encompassing nine different attack modes including visual and phonetic adversaries. We show that RoBERTa, NLP’s current workhorse, fails on our attacks. Our dataset provides a benchmark for testing robustness of future more human-like NLP models.
Anthology ID:
2020.aacl-main.79
Volume:
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Month:
December
Year:
2020
Address:
Suzhou, China
Venue:
AACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
786–803
Language:
URL:
https://aclanthology.org/2020.aacl-main.79
DOI:
Bibkey:
Cite (ACL):
Steffen Eger and Yannik Benz. 2020. From Hero to Zéroe: A Benchmark of Low-Level Adversarial Attacks. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 786–803, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
From Hero to Zéroe: A Benchmark of Low-Level Adversarial Attacks (Eger & Benz, AACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.aacl-main.79.pdf
Code
 yannikbenz/zeroe
Data
Jigsaw Toxic Comment Classification DatasetSNLI