Measuring Watermarking under Jailbreaking: ASR Inflation and Goal-Compliance Mismatch
Sungwoo Han, Sangjun Moon, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura
Abstract
Recently, watermarking has attracted growing attention as a practical technique for source attribution of machine-generated text. However, most prior work studies watermarking under benign prompts, while its behavior under jailbreaking prompts remains underexplored. This gap matters because jailbreaking can bypass safety policies and shift the generation regime, raising concerns that watermarking may interact with model alignment under attack. To address this gap, we evaluate six watermarking methods on four LLMs across two jailbreak benchmarks and three settings: Static, AutoDAN, and DSN. We find that watermarking can inflate judge-based attack success rate, denoted ASR, under jailbreaking, with the largest effects appearing in biased schemes that perturb logits. At the same time, these ASR increases often do not reflect higher harmful-goal compliance when measured by StrongREJECT or by human judgments. This suggests that ASR-only evaluations can be brittle to decoding perturbations and may overestimate harmful-goal compliance, motivating complementary goal-compliance metrics (e.g., StrongREJECT) and human evaluations.- Anthology ID:
- 2026.findings-acl.1797
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 36071–36083
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1797/
- DOI:
- Cite (ACL):
- Sungwoo Han, Sangjun Moon, Jingun Kwon, Hidetaka Kamigaito, and Manabu Okumura. 2026. Measuring Watermarking under Jailbreaking: ASR Inflation and Goal-Compliance Mismatch. In Findings of the Association for Computational Linguistics: ACL 2026, pages 36071–36083, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Measuring Watermarking under Jailbreaking: ASR Inflation and Goal-Compliance Mismatch (Han et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1797.pdf