Hongyan Chang
2025
Context-Aware Membership Inference Attacks against Pre-trained Large Language Models
Hongyan Chang
|
Ali Shahin Shamsabadi
|
Kleomenis Katevas
|
Hamed Haddadi
|
Reza Shokri
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Membership Inference Attacks (MIAs) on pre-trained Large Language Models (LLMs) aim at determining if a data point was part of the model’s training set. Prior MIAs that are built for classification models fail at LLMs, due to ignoring the generative nature of LLMs across token sequences. In this paper, we present a novel attack on pre-trained LLMs that adapts MIA statistical tests to the perplexity dynamics of subsequences within a data point. Our method significantly outperforms prior approaches, revealing context-dependent memorization patterns in pre-trained LLMs.
Watermark Smoothing Attacks against Language Models
Hongyan Chang
|
Hamed Hassani
|
Reza Shokri
Findings of the Association for Computational Linguistics: EMNLP 2025
Watermarking is a key technique for detecting AI-generated text. In this work, we study its vulnerabilities and introduce the Smoothing Attack, a novel watermark removal method. By leveraging the relationship between the model’s confidence and watermark detectability, our attack selectively smoothes the watermarked content, erasing watermark traces while preserving text quality. We validate our attack on open-source models ranging from 1.3 B to 30B parameters on 10 different watermarks, demonstrating its effectiveness. Our findings expose critical weaknesses in existing watermarking schemes and highlight the need for stronger defenses.