RedHit: Adaptive Red-Teaming of Large Language Models via Search, Reasoning, and Preference Optimization

Mohsen Sorkhpour; Abbas Yazdinejad; Ali Dehghantanha

RedHit: Adaptive Red-Teaming of Large Language Models via Search, Reasoning, and Preference Optimization

Mohsen Sorkhpour, Abbas Yazdinejad, Ali Dehghantanha

Abstract

Red-teaming has become a critical component of Large Language Models (LLMs) security amid increasingly sophisticated adversarial techniques. However, existing methods often depend on hard-coded strategies that quickly become obsolete against novel attack patterns, requiring constant updates.Moreover, current automated red-teaming approaches typically lack effective reasoning ca- pabilities, leading to lower attack success rates and longer training times. In this paper, we propose RedHit, a multi-round, automated, and adaptive red-teaming framework that integrates Monte Carlo Tree Search (MCTS), Chain-of-Thought (CoT) reasoning, and Direct Preference Optimization (DPO) to enhance the adversarial capabilities of an Adversarial LLM (ALLM). RedHit formulates prompt injection as a tree search problem, where the goal is to discover adversarial prompts capable of bypassing target model defenses. Each search step is guided by an Evaluator module that dynamically scores model responses using multi-detector feedback, yielding fine-grained reward signals. MCTS is employed to explore the space of adversarial prompts, incrementally constructing a Prompt Search Tree (PST) in which each node stores an adversarial prompt, its response, a reward, and other control properties. Prompts are generated via a locally hosted IndirectPromptGenerator module, which uses CoT-enabled prompt transformation to create multi-perspective, semantically equivalent variants for deeper tree exploration. CoT reasoning improves MCTS exploration by injecting strategic insights derived from past interactions, enabling RedHit to adapt dynamically to the target LLM’s defenses. Furthermore, DPO fine-tunes ALLM using preference data collected from previous attack rounds, progressively enhancing its ability to generate more effective prompts. Red-Hit leverages the Garak framework to evaluate each adversarial prompt and compute rewards,demonstrating robust and adaptive adversarial behavior across multiple attack rounds.

Anthology ID:: 2025.llmsec-1.2
Volume:: Proceedings of the The First Workshop on LLM Security (LLMSEC)
Month:: August
Year:: 2025
Address:: Vienna, Austria
Editor:: Jekaterina Novikova
Venues:: LLMSEC | WS
SIG:: SIGSEC
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7–16
Language:
URL:: https://preview.aclanthology.org/corrections-2025-08/2025.llmsec-1.2/
DOI:
Bibkey:
Cite (ACL):: Mohsen Sorkhpour, Abbas Yazdinejad, and Ali Dehghantanha. 2025. RedHit: Adaptive Red-Teaming of Large Language Models via Search, Reasoning, and Preference Optimization. In Proceedings of the The First Workshop on LLM Security (LLMSEC), pages 7–16, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: RedHit: Adaptive Red-Teaming of Large Language Models via Search, Reasoning, and Preference Optimization (Sorkhpour et al., LLMSEC 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/corrections-2025-08/2025.llmsec-1.2.pdf
Supplementarymaterial:: 2025.llmsec-1.2.SupplementaryMaterial.txt
Supplementarymaterial:: 2025.llmsec-1.2.SupplementaryMaterial.zip

PDF Cite Search Supplementarymaterial Supplementarymaterial Fix data