TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

Dominik Meier; Jan Philip Wahle; Paul Röttger; Terry Ruas; Bela Gipp

doi:10.18653/v1/2025.emnlp-main.1386

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, Bela Gipp

Abstract

As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information (“secrets”). We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the TrojanStego threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning that is learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, the compromised LLMs maintain high utility, coherence, and can evade human detection. Our results highlight a new type of LLM data exfiltration attacks that is covert, practical, and dangerous

Anthology ID:: 2025.emnlp-main.1386
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27232–27249
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.1386/
DOI:: 10.18653/v1/2025.emnlp-main.1386
Bibkey:
Cite (ACL):: Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, and Bela Gipp. 2025. TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27232–27249, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent (Meier et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.1386.pdf
Checklist:: 2025.emnlp-main.1386.checklist.pdf

PDF Cite Search Checklist Fix data