Know Thine Enemy: Adaptive Attacks on Misinformation Detection Using Reinforcement Learning

Piotr Przybyła, Euan McGill, Horacio Saggion


Abstract
We present XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. Our solution is adaptive, it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. This reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We evaluate our approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs. We also perform a qualitative analysis to understand the language patterns in the misinformation text that play a role in the attacks.
Anthology ID:
2024.wassa-1.11
Volume:
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Orphée De Clercq, Valentin Barriere, Jeremy Barnes, Roman Klinger, João Sedoc, Shabnam Tafreshi
Venues:
WASSA | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
125–140
Language:
URL:
https://aclanthology.org/2024.wassa-1.11
DOI:
Bibkey:
Cite (ACL):
Piotr Przybyła, Euan McGill, and Horacio Saggion. 2024. Know Thine Enemy: Adaptive Attacks on Misinformation Detection Using Reinforcement Learning. In Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, pages 125–140, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Know Thine Enemy: Adaptive Attacks on Misinformation Detection Using Reinforcement Learning (Przybyła et al., WASSA-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.wassa-1.11.pdf