Shanqing Guo
2026
Beyond the Safety Tax: Mitigating Unsafe Text-to-Image Generation via External Safety Rectification
Xiangtao Meng | Yingkai Dong | Ning Yu | Li Wang | Zheng Li | Shanqing Guo
Findings of the Association for Computational Linguistics: ACL 2026
Xiangtao Meng | Yingkai Dong | Ning Yu | Li Wang | Zheng Li | Shanqing Guo
Findings of the Association for Computational Linguistics: ACL 2026
Text-to-image (T2I) generative models have achieved remarkable visual fidelity, yet remain vulnerable to generating unsafe content. Existing safety defenses typically intervene internally within the generative model, but suffer from severe concept entanglement, leading to degradation of benign generation quality—a trade-off we term the Safety Tax. To overcome this limitation, we advocate a paradigm shift from destructive internal editing to external safety rectification. Following this principle, we propose SafePatch, a structurally isolated safety module that performs external, interpretable rectification without modifying the base model. The core backbone of SafePatch is architecturally instantiated as a trainable clone of the base model’s encoder, allowing it to inherit rich semantic priors and maintain representation consistency. To enable interpretable safety rectification, we construct a strictly aligned counterfactual safety dataset (ACS) for differential supervision training. Across nudity and multi-category bench- marks and recent adversarial prompt attacks, SafePatch achieves robust unsafe suppression (7% unsafe on I2P) while preserving image quality and semantic alignment.
2025
DROWN: Towards Tighter LiRPA-based Robustness Certification
Yunruo Zhang | Tianyu Du | Shouling Ji | Shanqing Guo
Proceedings of the 31st International Conference on Computational Linguistics
Yunruo Zhang | Tianyu Du | Shouling Ji | Shanqing Guo
Proceedings of the 31st International Conference on Computational Linguistics
The susceptibility of deep neural networks to adversarial attacks is a well-established concern. To address this problem, robustness certification is proposed, which, unfortunately, suffers from precision or scalability issues. In this paper, we present DROWN (Dual CROWN), a novel method for certifying the robustness of DNNs. The advantage of DROWN is that it tightens classic LiRPA-based methods yet maintains similar scalability, which comes from refining pre-activation bounds of ReLU relaxations using two pairs of linear bounds derived from different relaxations of ReLU units in previous layers. The extensive evaluations show that DROWN achieves up to 83.39% higher certified robust accuracy than the baseline on CNNs and up to 4.68 times larger certified radii than the baseline on Transformers. Meanwhile, the running time of DROWN is about twice that of the baseline.