In this paper, we investigate the overlooked impact of instruction-tuning on memorization in large language models (LLMs), which has largely been studied in base, pre-trained models. We propose a black-box prompt optimization method where an attacker LLM agent uncovers higher levels of memorization in a victim agent, surpassing traditional approaches that prompt the model directly with training data. Using an iterative rejection-sampling process, we design instruction-based prompts that minimize overlap with training data to avoid providing direct solutions while maximizing overlap between the victim’s output and the training data to induce memorization. Our method shows 23.7% more overlap with training data compared to state-of-the-art baselines. We explore two attack settings: an analytical approach that determines the empirical upper bound of the attack, both with and without access to responses for prompt initialization, and a practical classifier-based method for assessing memorization without access to memorized data. Our findings reveal that instruction-tuned models can expose pre-training data as much as, or more than, base models; contexts beyond the original training data can lead to leakage; and instructions generated by other LLMs open new avenues for automated attacks, which we believe require further exploration.
Adversarial attacks against Language models (LMs) are a significant concern. In particular, adversarial samples exploit the model’s sensitivity to small input changes. While these changes appear insignificant on the semantics of the input sample, they result in significant decay in model performance. In this paper, we propose Targeted Paraphrasing via RL (TPRL), an approach to automatically learn a policy to generate challenging samples that improve the model’s performance. TPRL leverages FLAN-T5, a language model, as a generator and employs a self-learned policy using a proximal policy optimization to generate the adversarial examples automatically. TPRL’s reward is based on the confusion induced in the classifier, preserving the original text meaning through a Mutual Implication score. We demonstrate & evaluate TPRL’s effectiveness in discovering natural adversarial attacks and improving model performance through extensive experiments on four diverse NLP classification tasks via Automatic & Human evaluation. TPRL outperforms strong baselines, exhibits generalizability across classifiers and datasets, and combines the strengths of language modeling and reinforcement learning to generate diverse and influential adversarial examples.
Large Language models (LLMs) are trained on vast amounts of data, including sensitive information that poses a risk to personal privacy if exposed. LLMs have shown the ability to memorize and reproduce portions of their training data when prompted by adversaries. Prior research has focused on addressing this memorization issue and preventing verbatim replication through techniques like knowledge unlearning and data pre-processing. However, these methods have limitations regarding the number of protected samples, limited privacy types, and potentially lower-quality generative models. To tackle this challenge more effectively, we propose “DeMem,” a novel unlearning approach that utilizes an efficient reinforcement learning feedback loop via proximal policy optimization. By fine-tuning the language model with a negative similarity score as a reward signal, we incentivize the LLMs to learn a paraphrasing policy to unlearn the pre-training data. Our experiments demonstrate that DeMem surpasses strong baselines and state-of-the-art methods in terms of its ability to generalize and strike a balance between maintaining privacy and LLM performance.