Watermark Smoothing Attacks against Language Models

Hongyan Chang, Hamed Hassani, Reza Shokri


Abstract
Watermarking is a key technique for detecting AI-generated text. In this work, we study its vulnerabilities and introduce the Smoothing Attack, a novel watermark removal method. By leveraging the relationship between the model’s confidence and watermark detectability, our attack selectively smoothes the watermarked content, erasing watermark traces while preserving text quality. We validate our attack on open-source models ranging from 1.3 B to 30B parameters on 10 different watermarks, demonstrating its effectiveness. Our findings expose critical weaknesses in existing watermarking schemes and highlight the need for stronger defenses.
Anthology ID:
2025.findings-emnlp.264
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4915–4941
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.264/
DOI:
10.18653/v1/2025.findings-emnlp.264
Bibkey:
Cite (ACL):
Hongyan Chang, Hamed Hassani, and Reza Shokri. 2025. Watermark Smoothing Attacks against Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 4915–4941, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Watermark Smoothing Attacks against Language Models (Chang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.264.pdf
Checklist:
 2025.findings-emnlp.264.checklist.pdf