Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection

San Kim, Jonghwi Kim, Yejin Jeon, Gary Lee


Abstract
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk; attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever’s similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.
Anthology ID:
2025.findings-acl.1263
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24597–24614
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1263/
DOI:
Bibkey:
Cite (ACL):
San Kim, Jonghwi Kim, Yejin Jeon, and Gary Lee. 2025. Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection. In Findings of the Association for Computational Linguistics: ACL 2025, pages 24597–24614, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection (Kim et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1263.pdf