Wenjie Ruan


2026

Although Large Language Models undergo rigorous safety alignment, they remain vulnerable to adversarial attacks. Existing methods, particularly gradient-based prompt optimization, suffer from high computational costs and produce uninterpretable, high-perplexity inputs. While recent logit-space attacks improve efficiency, they often rely on cumbersome auxiliary models or complex pipelines. In this work, we propose Sparse Index-Based Intervention (SIBI), a white-box, inference-time jailbreak that bypasses guardrails via lightweight, sparse logit editing. SIBI operates without gradients or auxiliary models, modifying pre-softmax logits using a compact, tokenizer-aligned dictionary of penalty and reward tokens. By incorporating temperature-consistent scaling and a mixed-norm trust region, the method ensures attack effectiveness while preserving generation fluency. On standard benchmarks, SIBI achieves high attack success rates while reducing computational overhead and space overhead compared to optimization baselines.
State-of-the-art large language models (LLMs) have achieved impressive results on various tasks. However, these architectures are vulnerable to jailbreak attacks, such as GCG and AutoDAN. Several defense strategies have been proposed to protect LLMs from generating harmful content, with most methods focusing on model fine-tuning or heuristic defense designs. These methods are often time-consuming or less effective. To fill this gap, this paper proposes a novel defense solution by taking the advances of online In-Context Learning (ICL) and an offline defensive suffix. Specifically, we first optimize the offline defensive suffix using an iterative algorithm. Second, an online stochastic random search is conducted to identify the most effective ICL demonstrations. Finally, the original user instruction, the selected ICL demonstrations, and the defensive suffix are assembled into a structured input prompt using a carefully designed template, which is then fed into the LLM for response generation. Experimental results show that our method is effective against both advanced white-box and black-box attacks, reducing the attack success rate to nearly *0%*, while maintaining the model’s utility on the benign tasks and incurring only *negligible* computational overhead. Our code is available on https://github.com/Trusted-LLM/DSICL.

2024

Language models are vulnerable to clandestinely modified data and manipulation by attackers. Despite considerable research dedicated to enhancing robustness against adversarial attacks, the realm of provable robustness for backdoor attacks remains relatively unexplored. In this paper, we initiate a pioneering investigation into the certified robustness of NLP models against backdoor triggers.We propose a model-agnostic mechanism for large-scale models that applies to complex model structures without the need for assessing model architecture or internal knowledge. More importantly, we take recent advances in randomized smoothing theory and propose a novel weight-based distribution algorithm to enable semantic similarity and provide theoretical robustness guarantees.Experimentally, we demonstrate the efficacy of our approach across a diverse range of datasets and tasks, highlighting its utility in mitigating backdoor triggers. Our results show strong performance in terms of certified accuracy, scalability, and semantic preservation.

2023

When textual classifiers are deployed in safety-critical workflows, they must withstand the onslaught of AI-enabled model confusion caused by adversarial examples with minor alterations. In this paper, the main objective is to provide a formal verification framework, called TextVerifier, with certifiable guarantees on deep neural networks in natural language processing against word-level alteration attacks. We aim to provide an approximation of the maximal safe radius by deriving provable bounds both mathematically and automatically, where a minimum word-level L_0 distance is quantified as a guarantee for the classification invariance of victim models. Here, we illustrate three strengths of our strategy: i) certifiable guarantee: effective verification with convergence to ensure approximation of maximal safe radius with tight bounds ultimately; ii) high-efficiency: it yields an efficient speed edge by a novel parallelization strategy that can process a set of candidate texts simultaneously on GPUs; and iii) reliable anytime estimation: the verification can return intermediate bounds, and robustness estimates that are gradually, but strictly, improved as the computation proceeds. Furthermore, experiments are conducted on text classification on four datasets over three victim models to demonstrate the validity of tightening bounds. Our tool TextVerifier is available at https://github.com/TrustAI/TextVerifer.