SubmissionNumber#=%=#5
FinalPaperTitle#=%=#RedHit: Adaptive Red-Teaming of Large Language Models via Search, Reasoning, and Preference Optimization
ShortPaperTitle#=%=#
NumberOfPages#=%=#10
CopyrightSigned#=%=#ALI DEHGHANATANHA
JobTitle#==#
Organization#==#University of Guelph, ON, Canada
Abstract#==#Red-teaming has become a critical component of Large Language Models (LLMs) security amid increasingly sophisticated adversarial techniques. However, existing methods often depend on hard-coded strategies that quickly become obsolete against novel attack patterns, requiring constant updates.Moreover, current automated red-teaming approaches typically lack effective reasoning ca-
pabilities, leading to lower attack success rates and longer training times. In this paper, we
propose RedHit, a multi-round, automated, and adaptive red-teaming framework that in-
tegrates Monte Carlo Tree Search (MCTS), Chain-of-Thought (CoT) reasoning, and Direct Preference Optimization (DPO) to enhance the adversarial capabilities of an Adversarial
LLM (ALLM). RedHit formulates prompt injection as a tree search problem, where the
goal is to discover adversarial prompts capable of bypassing target model defenses. Each
search step is guided by an Evaluator module that dynamically scores model responses
using multi-detector feedback, yielding fine-grained reward signals. MCTS is employed to
explore the space of adversarial prompts, incrementally constructing a Prompt Search Tree
(PST) in which each node stores an adversarial prompt, its response, a reward, and other
control properties. Prompts are generated via a locally hosted IndirectPromptGenerator
module, which uses CoT-enabled prompt transformation to create multi-perspective, seman-
tically equivalent variants for deeper tree exploration. CoT reasoning improves MCTS
exploration by injecting strategic insights derived from past interactions, enabling RedHit
to adapt dynamically to the target LLM's defenses. Furthermore, DPO fine-tunes ALLM
using preference data collected from previous attack rounds, progressively enhancing its abil-
ity to generate more effective prompts. Red-Hit leverages the Garak framework to evaluate
each adversarial prompt and compute rewards,demonstrating robust and adaptive adversarial
behavior across multiple attack rounds.
Author{1}{Firstname}#=%=#Mohsen
Author{1}{Lastname}#=%=#Sorkhpour
Author{1}{Email}#=%=#msorkhpo@uoguelph.ca
Author{1}{Affiliation}#=%=#Cyber Science Lab, University of Guelph
Author{2}{Firstname}#=%=#Abbas
Author{2}{Lastname}#=%=#Yazdinejad
Author{2}{Email}#=%=#ayazdine@uoguelph.ca
Author{2}{Affiliation}#=%=#Cyber Science Lab, University of Guelph
Author{3}{Firstname}#=%=#Ali
Author{3}{Lastname}#=%=#Dehghantanha
Author{3}{Username}#=%=#alidehghantanha
Author{3}{Email}#=%=#adehghan@uoguelph.ca
Author{3}{Affiliation}#=%=#University of Guelph

==========
èéáğö