Atoxia: Red-teaming Large Language Models with Target Toxic Answers

Yuhao Du, Zhuo Li, Pengyu Cheng, Xiang Wan, Anningzhe Gao


Abstract
Despite the substantial advancements in artificial intelligence, large language models (LLMs) remain being challenged by generation safety. With adversarial jailbreaking prompts, one can effortlessly induce LLMs to output harmful content, causing unexpected negative social impacts. This vulnerability highlights the necessity for robust LLM red-teaming strategies to identify and mitigate such risks before large-scale application. To detect specific types of risks, we propose a novel red-teaming method that **A**ttacks LLMs with **T**arget **Toxi**c **A**nswers (**Atoxia**). Given a particular harmful answer, Atoxia generates a corresponding user query and a misleading answer opening to examine the internal defects of a given LLM. The proposed attacker is trained within a reinforcement learning scheme with the LLM outputting probability of the target answer as the reward. We verify the effectiveness of our method on various red-teaming benchmarks, such as AdvBench and HH-Harmless. The empirical results demonstrate that Atoxia can successfully detect safety risks in not only open-source models but also state-of-the-art black-box models such as GPT-4o.
Anthology ID:
2025.findings-naacl.179
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3251–3266
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.179/
DOI:
Bibkey:
Cite (ACL):
Yuhao Du, Zhuo Li, Pengyu Cheng, Xiang Wan, and Anningzhe Gao. 2025. Atoxia: Red-teaming Large Language Models with Target Toxic Answers. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3251–3266, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Atoxia: Red-teaming Large Language Models with Target Toxic Answers (Du et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.179.pdf