Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models

Shahnewaz Karim Sakib, Anindya Bijoy Das, Shibbir Ahmed


Abstract
Adversarial factuality refers to the deliberate insertion of misinformation into input prompts by an adversary, characterized by varying levels of expressed confidence. In this study, we systematically evaluate the performance of several open-source large language models (LLMs) when exposed to such adversarial inputs. Three tiers of adversarial confidence are considered: strongly confident, moderately confident, and limited confidence. Our analysis encompasses eight LLMs: LLaMA 3.1 (8B), Phi 3 (3.8B), Qwen 2.5 (7B), Deepseek-v2 (16B), Gemma2 (9B), Falcon (7B), Mistrallite (7B), and LLaVA (7B). Empirical results indicate that LLaMA 3.1 (8B) exhibits a robust capability in detecting adversarial inputs, whereas Falcon (7B) shows comparatively lower performance. Notably, for the majority of the models, detection success improves as the adversary’s confidence decreases; however, this trend is reversed for LLaMA 3.1 (8B) and Phi 3 (3.8B), where a reduction in adversarial confidence corresponds with diminished detection performance. Further analysis of the queries that elicited the highest and lowest rates of successful attacks reveals that adversarial attacks are more effective when targeting less commonly referenced or obscure information.
Anthology ID:
2025.trustnlp-main.28
Volume:
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Trista Cao, Anubrata Das, Tharindu Kumarage, Yixin Wan, Satyapriya Krishna, Ninareh Mehrabi, Jwala Dhamala, Anil Ramakrishna, Aram Galystan, Anoop Kumar, Rahul Gupta, Kai-Wei Chang
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
432–443
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.28/
DOI:
Bibkey:
Cite (ACL):
Shahnewaz Karim Sakib, Anindya Bijoy Das, and Shibbir Ahmed. 2025. Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 432–443, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models (Sakib et al., TrustNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.28.pdf