Changlong Li

2026

Although Large Language Models undergo rigorous safety alignment, they remain vulnerable to adversarial attacks. Existing methods, particularly gradient-based prompt optimization, suffer from high computational costs and produce uninterpretable, high-perplexity inputs. While recent logit-space attacks improve efficiency, they often rely on cumbersome auxiliary models or complex pipelines. In this work, we propose Sparse Index-Based Intervention (SIBI), a white-box, inference-time jailbreak that bypasses guardrails via lightweight, sparse logit editing. SIBI operates without gradients or auxiliary models, modifying pre-softmax logits using a compact, tokenizer-aligned dictionary of penalty and reward tokens. By incorporating temperature-consistent scaling and a mixed-norm trust region, the method ensures attack effectiveness while preserving generation fluency. On standard benchmarks, SIBI achieves high attack success rates while reducing computational overhead and space overhead compared to optimization baselines.

pdf bib abs

State-of-the-art large language models (LLMs) have achieved impressive results on various tasks. However, these architectures are vulnerable to jailbreak attacks, such as GCG and AutoDAN. Several defense strategies have been proposed to protect LLMs from generating harmful content, with most methods focusing on model fine-tuning or heuristic defense designs. These methods are often time-consuming or less effective. To fill this gap, this paper proposes a novel defense solution by taking the advances of online In-Context Learning (ICL) and an offline defensive suffix. Specifically, we first optimize the offline defensive suffix using an iterative algorithm. Second, an online stochastic random search is conducted to identify the most effective ICL demonstrations. Finally, the original user instruction, the selected ICL demonstrations, and the defensive suffix are assembled into a structured input prompt using a carefully designed template, which is then fed into the LLM for response generation. Experimental results show that our method is effective against both advanced white-box and black-box attacks, reducing the attack success rate to nearly *0%*, while maintaining the model’s utility on the benign tasks and incurring only *negligible* computational overhead. Our code is available on https://github.com/Trusted-LLM/DSICL.

Co-authors

Venues

Findings2

Fix author