Guy Van Den Broeck
Also published as: Guy Van den Broeck
2026
Enabling Autoregressive Models to Fill In Masked Tokens
Daniel Mingyi Israel | Aditya Grover | Guy Van den Broeck
Findings of the Association for Computational Linguistics: EACL 2026
Daniel Mingyi Israel | Aditya Grover | Guy Van den Broeck
Findings of the Association for Computational Linguistics: EACL 2026
Historically, LLMs have been trained using either autoregressive (AR) or masked language modeling (MLM) objectives, with AR models gaining dominance in recent years. However, AR models are inherently incapable of masked infilling, which is the ability to predict masked tokens between past and future context. In contrast, MLM models suffer from intrinsic computational inefficiencies during both training and inference that hinder their scalability. This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that leverages the strengths of both paradigms to achieve state-of-the-art masked infilling performance. MARIA combines a pre-trained MLM and AR model by training a linear decoder that takes their concatenated hidden states as input. This minimal modification enables the AR model to perform infilling while retaining its inherent advantages in terms of faster inference with KV caching. Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
2025
Adversarial Tokenization
Renato Geh | Zilei Shao | Guy Van Den Broeck
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Renato Geh | Zilei Shao | Guy Van Den Broeck
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Current LLM pipelines account for only one possible tokenization for a given string, ignoring exponentially many alternative tokenizations during training and inference. For example, the Llama3 standard tokenization of penguin is [p,enguin], yet [peng,uin] is another perfectly valid alternative. In this paper, we show that despite LLMs being trained solely on one tokenization, they still retain semantic understanding of other tokenizations, raising questions about their implications in LLM safety. Put succinctly, we answer the following question: can we adversarially tokenize an obviously malicious string to evade safety and alignment restrictions? We show that not only is adversarial tokenization an effective yet previously neglected axis of attack, but it is also competitive against existing state-of-the-art adversarial approaches without changing the text of the harmful request. We empirically validate this exploit across three state-of-the-art LLMs and adversarial datasets, revealing a previously unknown vulnerability in subword models.
2024
Where is the signal in tokenization space?
Renato Geh | Honghua Zhang | Kareem Ahmed | Benjie Wang | Guy Van Den Broeck
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Renato Geh | Honghua Zhang | Kareem Ahmed | Benjie Wang | Guy Van Den Broeck
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) are typically shipped with tokenizers that *deterministically* encode text into so-called *canonical* token sequences, to which the LLMs assign probability values.One common assumption is that the probability of a piece of text is the probability of its canonical token sequence.However, the tokenization of a string is not unique: e.g., the Llama2 tokenizer encodes ‘Tokens‘ as ‘[Tok,ens]‘, but ‘[Tok,en,s]‘ also represents the same text.In this paper, we study non-canonical tokenizations.We prove that, given a string, it is computationally hard to find the most likely tokenization for an autoregressive LLM, as well as to compute the marginal probability over all possible tokenizations.We then show how the marginal is, in most cases, indistinguishable from the canonical probability.Surprisingly, we then empirically demonstrate the existence of a significant amount of signal hidden within tokenization space.Notably, by simply aggregating the probabilities of non-canonical tokenizations, we achieve improvements across a range of LLM evaluation benchmarks for a variety of architectures, including transformers and state space models.