Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models

Ruiyi Yan; Yugo Murawaki

doi:10.18653/v1/2025.emnlp-main.361

Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models

Abstract

Large language models have significantly enhanced the capacities and efficiency of text generation. On the one hand, they have improved the quality of text-based *steganography*. On the other hand, they have also underscored the importance of *watermarking* as a safeguard against malicious misuse. In this study, we focus on tokenization inconsistency (TI) between Alice and Bob in steganography and watermarking, where TI can undermine robustness. Our investigation reveals that the problematic tokens responsible for TI exhibit two key characteristics: **infrequency** and **temporariness**. Based on these findings, we propose two tailored solutions for TI elimination: *a stepwise verification* method for steganography and *a post-hoc rollback* method for watermarking. Experiments show that (1) compared to traditional disambiguation methods in steganography, directly addressing TI leads to improvements in fluency, imperceptibility, and anti-steganalysis capacity; (2) for watermarking, addressing TI enhances detectability and robustness against attacks.

Anthology ID:: 2025.emnlp-main.361
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7087–7109
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.361/
DOI:: 10.18653/v1/2025.emnlp-main.361
Bibkey:
Cite (ACL):: Ruiyi Yan and Yugo Murawaki. 2025. Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7087–7109, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models (Yan & Murawaki, EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.361.pdf
Checklist:: 2025.emnlp-main.361.checklist.pdf

PDF Cite Search Checklist Fix data