B4: A Black-Box Scrubbing Attack on LLM Watermarks

Baizhou Huang, Xiao Pu, Xiaojun Wan


Abstract
Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose B4, a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy distributions. Experimental results across 12 different settings demonstrate the superior performance of B4 compared with other baselines.
Anthology ID:
2025.naacl-long.460
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9113–9126
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.460/
DOI:
Bibkey:
Cite (ACL):
Baizhou Huang, Xiao Pu, and Xiaojun Wan. 2025. B4: A Black-Box Scrubbing Attack on LLM Watermarks. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 9113–9126, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
B4: A Black-Box Scrubbing Attack on LLM Watermarks (Huang et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.460.pdf