Changlong Li
2026
Dictionary Guided Sparse Logit Editing for Reliable Jailbreak Attacks
Shuaibiao Han | Ruiyang Ni | Zhiyu Yi | Changlong Li | Perley Xu | Wenjie Ruan
Findings of the Association for Computational Linguistics: ACL 2026
Shuaibiao Han | Ruiyang Ni | Zhiyu Yi | Changlong Li | Perley Xu | Wenjie Ruan
Findings of the Association for Computational Linguistics: ACL 2026
Although Large Language Models undergo rigorous safety alignment, they remain vulnerable to adversarial attacks. Existing methods, particularly gradient-based prompt optimization, suffer from high computational costs and produce uninterpretable, high-perplexity inputs. While recent logit-space attacks improve efficiency, they often rely on cumbersome auxiliary models or complex pipelines. In this work, we propose Sparse Index-Based Intervention (SIBI), a white-box, inference-time jailbreak that bypasses guardrails via lightweight, sparse logit editing. SIBI operates without gradients or auxiliary models, modifying pre-softmax logits using a compact, tokenizer-aligned dictionary of penalty and reward tokens. By incorporating temperature-consistent scaling and a mixed-norm trust region, the method ensures attack effectiveness while preserving generation fluency. On standard benchmarks, SIBI achieves high attack success rates while reducing computational overhead and space overhead compared to optimization baselines.
Defending LLMs against Jailbreak Attacks via Template-Based ICL with a Defensive Suffix
Ruiyang Ni | Changlong Li | Shuaibiao Han | Zhiyu Yi | Perley Xu | Wenjie Ruan
Findings of the Association for Computational Linguistics: ACL 2026
Ruiyang Ni | Changlong Li | Shuaibiao Han | Zhiyu Yi | Perley Xu | Wenjie Ruan
Findings of the Association for Computational Linguistics: ACL 2026
State-of-the-art large language models (LLMs) have achieved impressive results on various tasks. However, these architectures are vulnerable to jailbreak attacks, such as GCG and AutoDAN. Several defense strategies have been proposed to protect LLMs from generating harmful content, with most methods focusing on model fine-tuning or heuristic defense designs. These methods are often time-consuming or less effective. To fill this gap, this paper proposes a novel defense solution by taking the advances of online In-Context Learning (ICL) and an offline defensive suffix. Specifically, we first optimize the offline defensive suffix using an iterative algorithm. Second, an online stochastic random search is conducted to identify the most effective ICL demonstrations. Finally, the original user instruction, the selected ICL demonstrations, and the defensive suffix are assembled into a structured input prompt using a carefully designed template, which is then fed into the LLM for response generation. Experimental results show that our method is effective against both advanced white-box and black-box attacks, reducing the attack success rate to nearly *0%*, while maintaining the model’s utility on the benign tasks and incurring only *negligible* computational overhead. Our code is available on https://github.com/Trusted-LLM/DSICL.