Abstract
We present XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. Our solution is adaptive, it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. This reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We evaluate our approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs. We also perform a qualitative analysis to understand the language patterns in the misinformation text that play a role in the attacks.- Anthology ID:
- 2024.wassa-1.11
- Volume:
- Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Orphée De Clercq, Valentin Barriere, Jeremy Barnes, Roman Klinger, João Sedoc, Shabnam Tafreshi
- Venues:
- WASSA | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 125–140
- Language:
- URL:
- https://aclanthology.org/2024.wassa-1.11
- DOI:
- Cite (ACL):
- Piotr Przybyła, Euan McGill, and Horacio Saggion. 2024. Know Thine Enemy: Adaptive Attacks on Misinformation Detection Using Reinforcement Learning. In Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, pages 125–140, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Know Thine Enemy: Adaptive Attacks on Misinformation Detection Using Reinforcement Learning (Przybyła et al., WASSA-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.wassa-1.11.pdf