Red-Teaming NSFW Image Classifiers as Text-to-Image Safeguards

Tinghao Xie; Yueqi Xie; Alireza Zareian; Shuming Hu; Felix Juefei-Xu; Xiaowen Lin; Ankit Jain; Prateek Mittal; Li Chen

Red-Teaming NSFW Image Classifiers as Text-to-Image Safeguards

Tinghao Xie, Yueqi Xie, Alireza Zareian, Shuming Hu, Felix Juefei-Xu, Xiaowen Lin, Ankit Jain, Prateek Mittal, Li Chen

Abstract

Not Safe for Work (NSFW) image classifiers play a critical role in safeguarding text-to-image (T2I) systems. However, a concerning phenomenon has emerged in T2I systems – changes in text prompts that manipulate benign image elements can result in failed detection by NSFW classifiers – dubbed "*context shifts*." For instance, while a NSFW image of "*a nude person in an empty scene*" can be easily blocked by most NSFW classifiers, a stealthier one that depicts "*a nude person blending in a group of dressed people*" may evade detection. We ask: how to systematically reveal NSFW image classifiers’ failure against such context shifts?Towards this end, we present an automated red-teaming framework that leverages a set of generative AI tools. We propose an **exploration-exploitation** approach: **First**, in the *exploration* stage, we synthesize a diverse and massive 36K NSFW image dataset that facilitates our study of context shifts. We find that varying fractions (e.g., 4.1% to 36% nude and sexual content) of the dataset are misclassified by NSFW image classifiers like GPT-4o and Gemini. **Second**, in the *exploitation* stage, we leverage these failure cases to train a specialized LLM that rewrites unseen seed prompts into more evasive versions, increasing the likelihood of detection evasion by up to 6 times. Alarmingly, we show **these failures translate to real-world T2I and even T2V systems** like DALL-E 3, Sora, Nano Banana, and Veo 3 – beyond the open-weight image generators in our main study. For example, querying DALL-E 3 with prompts rewritten by our approach increases the chance of obtaining NSFW images from 0 to over 50%.

Anthology ID:: 2026.findings-acl.506
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10413–10441
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.506/
DOI:
Bibkey:
Cite (ACL):: Tinghao Xie, Yueqi Xie, Alireza Zareian, Shuming Hu, Felix Juefei-Xu, Xiaowen Lin, Ankit Jain, Prateek Mittal, and Li Chen. 2026. Red-Teaming NSFW Image Classifiers as Text-to-Image Safeguards. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10413–10441, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Red-Teaming NSFW Image Classifiers as Text-to-Image Safeguards (Xie et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.506.pdf
Checklist:: 2026.findings-acl.506.checklist.pdf

PDF Cite Search Checklist Fix data