Xiaowen Lin
2026
Red-Teaming NSFW Image Classifiers as Text-to-Image Safeguards
Tinghao Xie | Yueqi Xie | Alireza Zareian | Shuming Hu | Felix Juefei-Xu | Xiaowen Lin | Ankit Jain | Prateek Mittal | Li Chen
Findings of the Association for Computational Linguistics: ACL 2026
Tinghao Xie | Yueqi Xie | Alireza Zareian | Shuming Hu | Felix Juefei-Xu | Xiaowen Lin | Ankit Jain | Prateek Mittal | Li Chen
Findings of the Association for Computational Linguistics: ACL 2026
Not Safe for Work (NSFW) image classifiers play a critical role in safeguarding text-to-image (T2I) systems. However, a concerning phenomenon has emerged in T2I systems – changes in text prompts that manipulate benign image elements can result in failed detection by NSFW classifiers – dubbed "*context shifts*." For instance, while a NSFW image of "*a nude person in an empty scene*" can be easily blocked by most NSFW classifiers, a stealthier one that depicts "*a nude person blending in a group of dressed people*" may evade detection. We ask: how to systematically reveal NSFW image classifiers’ failure against such context shifts?Towards this end, we present an automated red-teaming framework that leverages a set of generative AI tools. We propose an **exploration-exploitation** approach: **First**, in the *exploration* stage, we synthesize a diverse and massive 36K NSFW image dataset that facilitates our study of context shifts. We find that varying fractions (e.g., 4.1% to 36% nude and sexual content) of the dataset are misclassified by NSFW image classifiers like GPT-4o and Gemini. **Second**, in the *exploitation* stage, we leverage these failure cases to train a specialized LLM that rewrites unseen seed prompts into more evasive versions, increasing the likelihood of detection evasion by up to 6 times. Alarmingly, we show **these failures translate to real-world T2I and even T2V systems** like DALL-E 3, Sora, Nano Banana, and Veo 3 – beyond the open-weight image generators in our main study. For example, querying DALL-E 3 with prompts rewritten by our approach increases the chance of obtaining NSFW images from 0 to over 50%.
2019
Answering Complex Open-domain Questions Through Iterative Query Generation
Peng Qi | Xiaowen Lin | Leo Mehr | Zijian Wang | Christopher D. Manning
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Peng Qi | Xiaowen Lin | Leo Mehr | Zijian Wang | Christopher D. Manning
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
It is challenging for current one-step retrieve-and-read question answering (QA) systems to answer questions like “Which novel by the author of ‘Armada’ will be adapted as a feature film by Steven Spielberg?” because the question seldom contains retrievable clues about the missing entity (here, the author). Answering such a question requires multi-hop reasoning where one must gather information about the missing entity (or facts) to proceed with further reasoning. We present GoldEn (Gold Entity) Retriever, which iterates between reading context and retrieving more supporting documents to answer open-domain multi-hop questions. Instead of using opaque and computationally expensive neural retrieval models, GoldEn Retriever generates natural language search queries given the question and available context, and leverages off-the-shelf information retrieval systems to query for missing entities. This allows GoldEn Retriever to scale up efficiently for open-domain multi-hop reasoning while maintaining interpretability. We evaluate GoldEn Retriever on the recently proposed open-domain multi-hop QA dataset, HotpotQA, and demonstrate that it outperforms the best previously published model despite not using pretrained language models such as BERT.