Beyond the Safety Tax: Mitigating Unsafe Text-to-Image Generation via External Safety Rectification

Xiangtao Meng, Yingkai Dong, Ning Yu, Li Wang, Zheng Li, Shanqing Guo


Abstract
Text-to-image (T2I) generative models have achieved remarkable visual fidelity, yet remain vulnerable to generating unsafe content. Existing safety defenses typically intervene internally within the generative model, but suffer from severe concept entanglement, leading to degradation of benign generation quality—a trade-off we term the Safety Tax. To overcome this limitation, we advocate a paradigm shift from destructive internal editing to external safety rectification. Following this principle, we propose SafePatch, a structurally isolated safety module that performs external, interpretable rectification without modifying the base model. The core backbone of SafePatch is architecturally instantiated as a trainable clone of the base model’s encoder, allowing it to inherit rich semantic priors and maintain representation consistency. To enable interpretable safety rectification, we construct a strictly aligned counterfactual safety dataset (ACS) for differential supervision training. Across nudity and multi-category bench- marks and recent adversarial prompt attacks, SafePatch achieves robust unsafe suppression (7% unsafe on I2P) while preserving image quality and semantic alignment.
Anthology ID:
2026.findings-acl.1394
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
27986–27998
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1394/
DOI:
Bibkey:
Cite (ACL):
Xiangtao Meng, Yingkai Dong, Ning Yu, Li Wang, Zheng Li, and Shanqing Guo. 2026. Beyond the Safety Tax: Mitigating Unsafe Text-to-Image Generation via External Safety Rectification. In Findings of the Association for Computational Linguistics: ACL 2026, pages 27986–27998, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Beyond the Safety Tax: Mitigating Unsafe Text-to-Image Generation via External Safety Rectification (Meng et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1394.pdf
Checklist:
 2026.findings-acl.1394.checklist.pdf