Xinghao Yang
2026
Enhancing the Transferability of Jailbreak Attacks on Large Language Models via Exploiting Reparameterization Invariance
Ao Wang | Xinghao Yang | Yongshun Gong | Wei Liu | Bao-di Liu | Weifeng Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ao Wang | Xinghao Yang | Yongshun Gong | Wei Liu | Bao-di Liu | Weifeng Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jailbreak attacks serve as a pivotal technique for evaluating the safety alignment of Large language models. Current token-level attacks have shown remarkable efficacy on open-source models by leveraging gradient-based optimization. However, these attacks suffer from poor cross-model transferability, severely limiting their utility on proprietary ones. To address this limitation, we propose Reparameterization Invariance Gradient-based Jailbreak (RIGJ), a natural gradient based framework designed to improve cross-model transferability. Unlike prior token-level methods whose optimization paths are constrained by model-specific Euclidean geometry, RIGJ defines update directions according to differences in output distributions rather than parameter-space distances. Since language models are trained to capture similar dependency structures of natural language, their output distributions share common geometry across architectures, yielding intrinsically model-agnostic optimization trajectories and substantially stronger jailbreak transferability. Extensive experiments demonstrate superior performance, increasing the cross-model Attack Success Rate and Average Harmfulness Score by 14.9 and 1.23, respectively. Our code is provided https://github.com/nohuma/AISafety_transfer_jailbreak_RIGJ_2026.
2025
Disentangled Information Bottleneck for Adversarial Text Defense
Yidan Xu | Xinghao Yang | Wei Liu | Bao-di Liu | Weifeng Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yidan Xu | Xinghao Yang | Wei Liu | Bao-di Liu | Weifeng Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Adversarial text defense is a significant strategy to protect modern NLP models from being attacked. Typical text defense methods usually enhance the model’s robustness by model retraining or equipping it with a data preprocessing step, aiming to eliminate the non-robust features and preserve the robust ones. Although some efforts have been made to recognize the robust features, e.g., by the information bottleneck (IB) technique, how to fully disentangle the robust and non-robust representation remains a big challenge. To alleviate this problem, we propose a novel text defense method, named Disentangled Information Bottleneck (DisIB), with two major merits. Firstly, we separate the robust features and non-robust features with a disentangled two-line framework rather than the one-line compression network in IB. This prevents the loss of robust features caused by information compression and produces complete robust features. Secondly, we design a discriminator network to approximate the minimum mutual information of the two lines, which sufficiently disentangles robust and non-robust features. To validate the effectiveness of our DisIB, we conduct a total of 96 defense experiments on four datasets by defending four popular attack methods. Experimental results elaborate that our method significantly outperforms six baselines, with accuracy improvements ranging from 3.8% to 20.7%.
2024
Adaptive Immune-based Sound-Shape Code Substitution for Adversarial Chinese Text Attacks
Ao Wang | Xinghao Yang | Chen Li | Bao-di Liu | Weifeng Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Ao Wang | Xinghao Yang | Chen Li | Bao-di Liu | Weifeng Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Adversarial textual examples reveal the vulnerability of natural language processing (NLP) models. Most existing text attack methods are designed for English text, while the robust implementation of the second popular language, i.e., Chinese with 1 billion users, is greatly underestimated. Although several Chinese attack methods have been presented, they either directly transfer from English attacks or adopt simple greedy search to optimize the attack priority, usually leading to unnatural sentences. To address these issues, we propose an adaptive Immune-based Sound-Shape Code (ISSC) algorithm for adversarial Chinese text attacks. Firstly, we leverage the Sound-Shape code to generate natural substitutions, which comprehensively integrate multiple Chinese features. Secondly, we employ adaptive immune algorithm (IA) to determine the replacement order, which can reduce the duplication of population to improve the search ability. Extensive experimental results validate the superiority of our ISSC in producing high-quality Chinese adversarial texts. Our code and data can be found in https://github.com/nohuma/chinese-attack-issc.