Imanol Schlag
2026
Learning Vision-Language Alignment in Unified LLMs with 24 Text Tokens per Image
Nicola Irmiger | Yixuan Xu | Raphael Kreft | Aram Davtyan | Manuel Kaufmann | Imanol Schlag
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Nicola Irmiger | Yixuan Xu | Raphael Kreft | Aram Davtyan | Manuel Kaufmann | Imanol Schlag
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
We explore how to adapt a pre-trained large language model to understand and generate both visual and textual information. We use an image tokenizer to compress images into discrete tokens, and train the model using the next-token prediction paradigm with the standard cross-entropy loss. A two-stage pre-training approach is applied, first training on image-only data and then on a small amount of image-text data. We evaluate how different image-text token mixing ratios during continual pre-training affect the model’s ability to retain language skills while learning visual representations. The resulting model shows promising signs of flexible multimodal understanding, bridging vision and language in a single pre-trained model.
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Alejandro Hernández-Cano | Alexander Hägele | Allen Hao Huang | Angelika Romanou | Antoni-Joan Solergibert | Barna Pásztor | Bettina Messmer | Dhia Garbaya | Eduard Frank Ďurech | Ido Hakimi | Juan Garcia Giraldo | Mete Ismayilzada | Negar Foroutan | Skander Moalla | Tiancheng Chen | Vinko Sabolčec | Yixuan Xu | Michael Aerni | Badr AlKhamissi | Inés Altemir Marinas | Mohammad Hossein Amani | Matin Ansaripour | Ilia Badanin | Harold Benoit | Emanuela Boros | Nicholas John Browning | Fabian Bösch | Maximilian Böther | Niklas Canova | Camille Challier | Clément Charmillot | Jonathan Coles | Jan Milan Deriu | Arnout Devos | Lukas Drescher | Daniil Dzenhaliou | Maud Ehrmann | Dongyang Fan | Simin Fan | Silin Gao | Miguel Gila | María Grandury | Diba Hashemi | Alexander Miserlis Hoyle | Jiaming Jiang | Mark Klein | Andrei Kucharavy | Anastasiia Kucherenko | Frederike Lübeck | Roman Machacek | Theofilos Ioannis Manitaras | Andreas Marfurt | Kyle Matoba | Simon Matrenok | Henrique Mendonça | Fawzi Roberto Mohamed | Syrielle Montariol | Luca Mouchel | Sven Najem-Meyer | Jingwei Ni | Gennaro Oliva | Matteo Pagliardini | Elia Palme | Andrei Panferov | Léo Paoletti | Marco Passerini | Ivan Pavlov | Auguste Poiroux | Kaustubh Ponkshe | Nathan Ranchin | Javier Rando | Mathieu Sauser | Jakhongir Saydaliev | Mukhammadali Sayfiddinov | Marian Schneider | Stefano Schuppli | Marco Scialanga | Andrei Semenov | Kumar Shridhar | Raghav Singhal | Anna Sotnikova | Alexander Sternfeld | Ayush Kumar Tarun | Paul Teiletche | Jannis Vamvas | Xiaozhe Yao | Hao Zhao | Alexander Ilic | Ana Klimovic | Andreas Krause | Caglar Gulcehre | David Rosenthal | Elliott Ash | Florian Tramèr | Joost VandeVondele | Livio Veraldi | Martin Rajman | Thomas C. Schulthess | Torsten Hoefler | Antoine Bosselut | Martin Jaggi | Imanol Schlag
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Alejandro Hernández-Cano | Alexander Hägele | Allen Hao Huang | Angelika Romanou | Antoni-Joan Solergibert | Barna Pásztor | Bettina Messmer | Dhia Garbaya | Eduard Frank Ďurech | Ido Hakimi | Juan Garcia Giraldo | Mete Ismayilzada | Negar Foroutan | Skander Moalla | Tiancheng Chen | Vinko Sabolčec | Yixuan Xu | Michael Aerni | Badr AlKhamissi | Inés Altemir Marinas | Mohammad Hossein Amani | Matin Ansaripour | Ilia Badanin | Harold Benoit | Emanuela Boros | Nicholas John Browning | Fabian Bösch | Maximilian Böther | Niklas Canova | Camille Challier | Clément Charmillot | Jonathan Coles | Jan Milan Deriu | Arnout Devos | Lukas Drescher | Daniil Dzenhaliou | Maud Ehrmann | Dongyang Fan | Simin Fan | Silin Gao | Miguel Gila | María Grandury | Diba Hashemi | Alexander Miserlis Hoyle | Jiaming Jiang | Mark Klein | Andrei Kucharavy | Anastasiia Kucherenko | Frederike Lübeck | Roman Machacek | Theofilos Ioannis Manitaras | Andreas Marfurt | Kyle Matoba | Simon Matrenok | Henrique Mendonça | Fawzi Roberto Mohamed | Syrielle Montariol | Luca Mouchel | Sven Najem-Meyer | Jingwei Ni | Gennaro Oliva | Matteo Pagliardini | Elia Palme | Andrei Panferov | Léo Paoletti | Marco Passerini | Ivan Pavlov | Auguste Poiroux | Kaustubh Ponkshe | Nathan Ranchin | Javier Rando | Mathieu Sauser | Jakhongir Saydaliev | Mukhammadali Sayfiddinov | Marian Schneider | Stefano Schuppli | Marco Scialanga | Andrei Semenov | Kumar Shridhar | Raghav Singhal | Anna Sotnikova | Alexander Sternfeld | Ayush Kumar Tarun | Paul Teiletche | Jannis Vamvas | Xiaozhe Yao | Hao Zhao | Alexander Ilic | Ana Klimovic | Andreas Krause | Caglar Gulcehre | David Rosenthal | Elliott Ash | Florian Tramèr | Joost VandeVondele | Livio Veraldi | Martin Rajman | Thomas C. Schulthess | Torsten Hoefler | Antoine Bosselut | Martin Jaggi | Imanol Schlag
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Open LLMs enable AI practitioners to control development costs by building on an existing foundation for downstream applications. While offering substantial promise, current models often fail to meet the needs of users needing open solutions aligned with responsible AI principles, including data compliance, transparency, and inclusivity. In this work, we present Apertus, a fully open suite of large language models (LLMs) designed to address responsibility shortcomings in today’s open model ecosystem, namely data responsibility and global representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of data memorization, we also adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. Apertus also drastically expands multilingual coverage, training on 15T tokens from over approximately 1800 languages, with about 40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivaling or surpassing open-weight counterparts.
2024
Swiss AI Initiative - Collecting Large Amounts of High-Quality Data for Training Large Language Models
Jan Deriu | Maud Ehrmann | Emanuela Boros | Maximilian Böther | Christiane Sibille | Ihor Protsenko | Marta Brucka | Imanol Schlag | Elliott Ash
Proceedings of the 9th edition of the Swiss Text Analytics Conference
Jan Deriu | Maud Ehrmann | Emanuela Boros | Maximilian Böther | Christiane Sibille | Ihor Protsenko | Marta Brucka | Imanol Schlag | Elliott Ash
Proceedings of the 9th edition of the Swiss Text Analytics Conference
On the Effect of (Near) Duplicate Subwords in Language Modelling
Anton Schäfer | Thomas Hofmann | Imanol Schlag | Tiago Pimentel
Findings of the Association for Computational Linguistics: ACL 2024
Anton Schäfer | Thomas Hofmann | Imanol Schlag | Tiago Pimentel
Findings of the Association for Computational Linguistics: ACL 2024
Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned random indices before being served to the LM. However, this process—while typically lossless—may lead to less efficient LM training, because it removes character-level information, thereby making it more difficult to generalise across similar subwords, such as *now* and *Now*. We refer to such subwords as **near duplicates**. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this, by duplicating each token in our LM’s vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that deduplicating them considerably hurts LM performance; but that this loss in performance can be easily mitigated.
Search
Fix author
Co-authors
- Elliott Ash 2
- Emanuela Boroş 2
- Maximilian Böther 2
- Jan Milan Deriu 2
- Maud Ehrmann 2
- Yixuan Xu 2
- Michael Aerni 1
- Badr AlKhamissi 1
- Mohammad Hossein Amani 1
- Matin Ansaripour 1
- Ilia Badanin 1
- Harold Benoit 1
- Antoine Bosselut 1
- Nicholas John Browning 1
- Marta Brucka 1
- Fabian Bösch 1
- Niklas Canova 1
- Camille Challier 1
- Clément Charmillot 1
- Tiancheng Chen 1
- Jonathan Coles 1
- Aram Davtyan 1
- Arnout Devos 1
- Lukas Drescher 1
- Daniil Dzenhaliou 1
- Dongyang Fan 1
- Simin Fan 1
- Negar Foroutan 1
- Silin Gao 1
- Dhia Garbaya 1
- Miguel Gila 1
- Juan Garcia Giraldo 1
- María Grandury 1
- Çağlar Gu̇lçehre 1
- Ido Hakimi 1
- Diba Hashemi 1
- Alejandro Hernández-Cano 1
- Torsten Hoefler 1
- Thomas Hofmann 1
- Alexander Miserlis Hoyle 1
- Allen Hao Huang 1
- Alexander Hägele 1
- Alexander Ilic 1
- Nicola Irmiger 1
- Mete Ismayilzada 1
- Martin Jaggi 1
- Jiaming Jiang 1
- Manuel Kaufmann 1
- Mark Klein 1
- Ana Klimovic 1
- Andreas Krause 1
- Raphael Kreft 1
- Andrei Kucharavy 1
- Anastasiia Kucherenko 1
- Frederike Lübeck 1
- Roman Machacek 1
- Theofilos Ioannis Manitaras 1
- Andreas Marfurt 1
- Inés Altemir Marinas 1
- Kyle Matoba 1
- Simon Matrenok 1
- Henrique Mendonça 1
- Bettina Messmer 1
- Skander Moalla 1
- Fawzi Roberto Mohamed 1
- Syrielle Montariol 1
- Luca Mouchel 1
- Sven Najem-Meyer 1
- Jingwei Ni 1
- Gennaro Oliva 1
- Matteo Pagliardini 1
- Elia Palme 1
- Andrei Panferov 1
- Léo Paoletti 1
- Marco Passerini 1
- Ivan Pavlov 1
- Tiago Pimentel 1
- Auguste Poiroux 1
- Kaustubh Ponkshe 1
- Ihor Protsenko 1
- Barna Pásztor 1
- Martin Rajman 1
- Nathan Ranchin 1
- Javier Rando 1
- Angelika Romanou 1
- David Rosenthal 1
- Vinko Sabolčec 1
- Mathieu Sauser 1
- Jakhongir Saydaliev 1
- Mukhammadali Sayfiddinov 1
- Marian Schneider 1
- Thomas C. Schulthess 1
- Stefano Schuppli 1
- Anton Schäfer 1
- Marco Scialanga 1
- Andrei Semenov 1
- Kumar Shridhar 1
- Christiane Sibille 1
- Raghav Singhal 1
- Antoni-Joan Solergibert 1
- Anna Sotnikova 1
- Alexander Sternfeld 1
- Ayush Kumar Tarun 1
- Paul Teiletche 1
- Florian Tramèr 1
- Jannis Vamvas 1
- Joost VandeVondele 1
- Livio Veraldi 1
- Xiaozhe Yao 1
- Hao Zhao 1
- Eduard Frank Ďurech 1