Fattane Zarrinkalam

2026

Defense Against Knowledge Poisoning Attack on GraphRAG
Havva Alizadeh Noughabi | Fattane Zarrinkalam | Ali Dehghantanha
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

GraphRAG augments large language models with structured knowledge graphs, enabling graph-based context selection and a more integrated view of the knowledge space. However, recent work shows that GraphRAG exposes a new attack surface: corpus-level knowledge poisoning can inject spurious entities and relationships during graph construction, corrupting query-specific subgraphs and steering the generator toward incorrect answers. We propose Hop-wise Guard for GraphRAG (HoG-GRAG), a defense layer between retriever and generator that decomposes multi-hop questions into ordered subqueries, monitors hop-wise execution for poisoning-induced inconsistencies, and locally repairs the retrieved subgraph by pruning compromised entities and relationships and adding only minimal missing evidence. Experiments on multi-hop datasets and multiple GraphRAG configurations show that HoG-GRAG recovers a large fraction of the lost performance. The code is available at https://github.com/CyberScienceLab/HoG-GRAG.

2025

pdf bib abs

RedHit: Adaptive Red-Teaming of Large Language Models via Search, Reasoning, and Preference Optimization
Mohsen Sorkhpour | Abbas Yazdinejad | Fattane Zarrinkalam | Ali Dehghantanha
Proceedings of the The First Workshop on LLM Security (LLMSEC)

Red-teaming has become a critical component of Large Language Models (LLMs) security amid increasingly sophisticated adversarial techniques. However, existing methods often depend on hard-coded strategies that quickly become obsolete against novel attack patterns, requiring constant updates.Moreover, current automated red-teaming approaches typically lack effective reasoning ca- pabilities, leading to lower attack success rates and longer training times. In this paper, we propose RedHit, a multi-round, automated, and adaptive red-teaming framework that integrates Monte Carlo Tree Search (MCTS), Chain-of-Thought (CoT) reasoning, and Direct Preference Optimization (DPO) to enhance the adversarial capabilities of an Adversarial LLM (ALLM). RedHit formulates prompt injection as a tree search problem, where the goal is to discover adversarial prompts capable of bypassing target model defenses. Each search step is guided by an Evaluator module that dynamically scores model responses using multi-detector feedback, yielding fine-grained reward signals. MCTS is employed to explore the space of adversarial prompts, incrementally constructing a Prompt Search Tree (PST) in which each node stores an adversarial prompt, its response, a reward, and other control properties. Prompts are generated via a locally hosted IndirectPromptGenerator module, which uses CoT-enabled prompt transformation to create multi-perspective, semantically equivalent variants for deeper tree exploration. CoT reasoning improves MCTS exploration by injecting strategic insights derived from past interactions, enabling RedHit to adapt dynamically to the target LLM’s defenses. Furthermore, DPO fine-tunes ALLM using preference data collected from previous attack rounds, progressively enhancing its ability to generate more effective prompts. Red-Hit leverages the Garak framework to evaluate each adversarial prompt and compute rewards,demonstrating robust and adaptive adversarial behavior across multiple attack rounds.

Co-authors

Venues

Fix author