Itay Yona

2026

In-Context Representation Hijacking
Itay Yona | Amir Sarid | Michael Karasik | Yossi Gandelsman
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce **Doublespeak**, a simple in-context representation hijacking attack against language models. The attack works by systematically replacing a harmful keyword (e.g., *bomb*) with a benign token (e.g., *carrot*) across multiple in-context examples, provided as a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., *"How to build a carrot?"*) are internally interpreted as disallowed instructions (*"How to build a bomb?"*), thereby bypassing the model’s safety alignment. We use interpretability tools to show this semantic shift occurs progressively across layers. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source systems, reaching 74% on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in LM latent space, indicating that current alignment strategies are insufficient and should instead operate at the representation level.

2025

pdf bib abs

Measuring memorization in language models via probabilistic extraction
Jamie Hayes | Marika Swanberg | Harsh Chaudhari | Itay Yona | Ilia Shumailov | Milad Nasr | Christopher A. Choquette-Choo | Katherine Lee | A. Feder Cooper
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) are susceptible to memorizing training data, raising concerns about the potential extraction of sensitive information at generation time. Discoverable extraction is the most common method for measuring this issue: split a training example into a prefix and suffix, then prompt the LLM with the prefix, and deem the example extractable if the LLM generates the matching suffix using greedy sampling. This definition yields a yes-or-no determination of whether extraction was successful with respect to a single query. Though efficient to compute, we show that this definition is unreliable because it does not account for non-determinism present in more realistic (non-greedy) sampling schemes, for which LLMs produce a range of outputs for the same prompt. We introduce probabilistic discoverable extraction, which, without additional cost, relaxes discoverable extraction by considering multiple queries to quantify the probability of extracting a target sequence. We evaluate our probabilistic measure across different models, sampling schemes, and training-data repetitions, and find that this measure provides more nuanced information about extraction risk compared to traditional discoverable extraction.

Co-authors

Venues

ACL1
NAACL1

Fix author