@inproceedings{han-etal-2025-nullspace,
    title = "Nullspace Disentanglement for Red Teaming Language Models",
    author = "Han, Yi  and
      Liu, Yuanxing  and
      Zhang, Weinan  and
      Liu, Ting",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1083/",
    pages = "21360--21376",
    ISBN = "979-8-89176-332-6",
    abstract = "With the widespread deployment of generative language models, concerns about safety issues have continuously grown. High-quality fine-tuning data generated from red teaming plays a crucial role in the model{'}s safety. Recently, automated red teaming approaches have been proposed to create test cases. However, these approaches, which rely on open-ended generation, encounter issues related to inefficiency and low attack success rates. In this work, we introduce a black-box approach that ingeniously exploits the unique properties of the nullspace to disentangle and regulate the crucial success information within test cases. Our study provides a brand-new perspective for automated red team research. Experimental results demonstrate that our approach outperforms baseline methods regarding the attack success rate. The generated test cases also excel in aspects of diversity and fluency."
}Markdown (Informal)
[Nullspace Disentanglement for Red Teaming Language Models](https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1083/) (Han et al., EMNLP 2025)
ACL