Imanol Schlag


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2024

pdf bib
On the Effect of (Near) Duplicate Subwords in Language Modelling
Anton Schäfer | Thomas Hofmann | Imanol Schlag | Tiago Pimentel
Findings of the Association for Computational Linguistics: ACL 2024

Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned random indices before being served to the LM. However, this process—while typically lossless—may lead to less efficient LM training, because it removes character-level information, thereby making it more difficult to generalise across similar subwords, such as *now* and *Now*. We refer to such subwords as **near duplicates**. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this, by duplicating each token in our LM’s vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that deduplicating them considerably hurts LM performance; but that this loss in performance can be easily mitigated.

pdf bib
Swiss AI Initiative - Collecting Large Amounts of High-Quality Data for Training Large Language Models
Jan Deriu | Maud Ehrmann | Emanuela Boros | Maximilian Böther | Christiane Sibille | Ihor Protsenko | Marta Brucka | Imanol Schlag | Elliott Ash
Proceedings of the 9th edition of the Swiss Text Analytics Conference