Measuring Watermarking under Jailbreaking: ASR Inflation and Goal-Compliance Mismatch

Sungwoo Han, Sangjun Moon, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura


Abstract
Recently, watermarking has attracted growing attention as a practical technique for source attribution of machine-generated text. However, most prior work studies watermarking under benign prompts, while its behavior under jailbreaking prompts remains underexplored. This gap matters because jailbreaking can bypass safety policies and shift the generation regime, raising concerns that watermarking may interact with model alignment under attack. To address this gap, we evaluate six watermarking methods on four LLMs across two jailbreak benchmarks and three settings: Static, AutoDAN, and DSN. We find that watermarking can inflate judge-based attack success rate, denoted ASR, under jailbreaking, with the largest effects appearing in biased schemes that perturb logits. At the same time, these ASR increases often do not reflect higher harmful-goal compliance when measured by StrongREJECT or by human judgments. This suggests that ASR-only evaluations can be brittle to decoding perturbations and may overestimate harmful-goal compliance, motivating complementary goal-compliance metrics (e.g., StrongREJECT) and human evaluations.
Anthology ID:
2026.findings-acl.1797
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
36071–36083
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1797/
DOI:
Bibkey:
Cite (ACL):
Sungwoo Han, Sangjun Moon, Jingun Kwon, Hidetaka Kamigaito, and Manabu Okumura. 2026. Measuring Watermarking under Jailbreaking: ASR Inflation and Goal-Compliance Mismatch. In Findings of the Association for Computational Linguistics: ACL 2026, pages 36071–36083, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Measuring Watermarking under Jailbreaking: ASR Inflation and Goal-Compliance Mismatch (Han et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1797.pdf
Checklist:
 2026.findings-acl.1797.checklist.pdf