Nullspace Disentanglement for Red Teaming Language Models

Yi Han, Yuanxing Liu, Weinan Zhang, Ting Liu


Abstract
With the widespread deployment of generative language models, concerns about safety issues have continuously grown. High-quality fine-tuning data generated from red teaming plays a crucial role in the model’s safety. Recently, automated red teaming approaches have been proposed to create test cases. However, these approaches, which rely on open-ended generation, encounter issues related to inefficiency and low attack success rates. In this work, we introduce a black-box approach that ingeniously exploits the unique properties of the nullspace to disentangle and regulate the crucial success information within test cases. Our study provides a brand-new perspective for automated red team research. Experimental results demonstrate that our approach outperforms baseline methods regarding the attack success rate. The generated test cases also excel in aspects of diversity and fluency.
Anthology ID:
2025.emnlp-main.1083
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21360–21376
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1083/
DOI:
Bibkey:
Cite (ACL):
Yi Han, Yuanxing Liu, Weinan Zhang, and Ting Liu. 2025. Nullspace Disentanglement for Red Teaming Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21360–21376, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Nullspace Disentanglement for Red Teaming Language Models (Han et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1083.pdf
Checklist:
 2025.emnlp-main.1083.checklist.pdf