Xiaozhe Yao
2026
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Alejandro Hernández-Cano | Alexander Hägele | Allen Hao Huang | Angelika Romanou | Antoni-Joan Solergibert | Barna Pásztor | Bettina Messmer | Dhia Garbaya | Eduard Frank Ďurech | Ido Hakimi | Juan Garcia Giraldo | Mete Ismayilzada | Negar Foroutan | Skander Moalla | Tiancheng Chen | Vinko Sabolčec | Yixuan Xu | Michael Aerni | Badr AlKhamissi | Inés Altemir Marinas | Mohammad Hossein Amani | Matin Ansaripour | Ilia Badanin | Harold Benoit | Emanuela Boros | Nicholas John Browning | Fabian Bösch | Maximilian Böther | Niklas Canova | Camille Challier | Clément Charmillot | Jonathan Coles | Jan Milan Deriu | Arnout Devos | Lukas Drescher | Daniil Dzenhaliou | Maud Ehrmann | Dongyang Fan | Simin Fan | Silin Gao | Miguel Gila | María Grandury | Diba Hashemi | Alexander Miserlis Hoyle | Jiaming Jiang | Mark Klein | Andrei Kucharavy | Anastasiia Kucherenko | Frederike Lübeck | Roman Machacek | Theofilos Ioannis Manitaras | Andreas Marfurt | Kyle Matoba | Simon Matrenok | Henrique Mendonça | Fawzi Roberto Mohamed | Syrielle Montariol | Luca Mouchel | Sven Najem-Meyer | Jingwei Ni | Gennaro Oliva | Matteo Pagliardini | Elia Palme | Andrei Panferov | Léo Paoletti | Marco Passerini | Ivan Pavlov | Auguste Poiroux | Kaustubh Ponkshe | Nathan Ranchin | Javier Rando | Mathieu Sauser | Jakhongir Saydaliev | Mukhammadali Sayfiddinov | Marian Schneider | Stefano Schuppli | Marco Scialanga | Andrei Semenov | Kumar Shridhar | Raghav Singhal | Anna Sotnikova | Alexander Sternfeld | Ayush Kumar Tarun | Paul Teiletche | Jannis Vamvas | Xiaozhe Yao | Hao Zhao | Alexander Ilic | Ana Klimovic | Andreas Krause | Caglar Gulcehre | David Rosenthal | Elliott Ash | Florian Tramèr | Joost VandeVondele | Livio Veraldi | Martin Rajman | Thomas C. Schulthess | Torsten Hoefler | Antoine Bosselut | Martin Jaggi | Imanol Schlag
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Alejandro Hernández-Cano | Alexander Hägele | Allen Hao Huang | Angelika Romanou | Antoni-Joan Solergibert | Barna Pásztor | Bettina Messmer | Dhia Garbaya | Eduard Frank Ďurech | Ido Hakimi | Juan Garcia Giraldo | Mete Ismayilzada | Negar Foroutan | Skander Moalla | Tiancheng Chen | Vinko Sabolčec | Yixuan Xu | Michael Aerni | Badr AlKhamissi | Inés Altemir Marinas | Mohammad Hossein Amani | Matin Ansaripour | Ilia Badanin | Harold Benoit | Emanuela Boros | Nicholas John Browning | Fabian Bösch | Maximilian Böther | Niklas Canova | Camille Challier | Clément Charmillot | Jonathan Coles | Jan Milan Deriu | Arnout Devos | Lukas Drescher | Daniil Dzenhaliou | Maud Ehrmann | Dongyang Fan | Simin Fan | Silin Gao | Miguel Gila | María Grandury | Diba Hashemi | Alexander Miserlis Hoyle | Jiaming Jiang | Mark Klein | Andrei Kucharavy | Anastasiia Kucherenko | Frederike Lübeck | Roman Machacek | Theofilos Ioannis Manitaras | Andreas Marfurt | Kyle Matoba | Simon Matrenok | Henrique Mendonça | Fawzi Roberto Mohamed | Syrielle Montariol | Luca Mouchel | Sven Najem-Meyer | Jingwei Ni | Gennaro Oliva | Matteo Pagliardini | Elia Palme | Andrei Panferov | Léo Paoletti | Marco Passerini | Ivan Pavlov | Auguste Poiroux | Kaustubh Ponkshe | Nathan Ranchin | Javier Rando | Mathieu Sauser | Jakhongir Saydaliev | Mukhammadali Sayfiddinov | Marian Schneider | Stefano Schuppli | Marco Scialanga | Andrei Semenov | Kumar Shridhar | Raghav Singhal | Anna Sotnikova | Alexander Sternfeld | Ayush Kumar Tarun | Paul Teiletche | Jannis Vamvas | Xiaozhe Yao | Hao Zhao | Alexander Ilic | Ana Klimovic | Andreas Krause | Caglar Gulcehre | David Rosenthal | Elliott Ash | Florian Tramèr | Joost VandeVondele | Livio Veraldi | Martin Rajman | Thomas C. Schulthess | Torsten Hoefler | Antoine Bosselut | Martin Jaggi | Imanol Schlag
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Open LLMs enable AI practitioners to control development costs by building on an existing foundation for downstream applications. While offering substantial promise, current models often fail to meet the needs of users needing open solutions aligned with responsible AI principles, including data compliance, transparency, and inclusivity. In this work, we present Apertus, a fully open suite of large language models (LLMs) designed to address responsibility shortcomings in today’s open model ecosystem, namely data responsibility and global representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of data memorization, we also adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. Apertus also drastically expands multilingual coverage, training on 15T tokens from over approximately 1800 languages, with about 40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivaling or surpassing open-weight counterparts.
2025
Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
Taishi Nakamura | Mayank Mishra | Simone Tedeschi | Yekun Chai | Jason T. Stillerman | Felix Friedrich | Prateek Yadav | Tanmay Laud | Vu Minh Chien | Terry Yue Zhuo | Diganta Misra | Ben Bogin | Xuan-Son Vu | Marzena Karpinska | Arnav Varma Dantuluri | Wojciech Kusa | Tommaso Furlanello | Rio Yokota | Niklas Muennighoff | Suhas Pai | Tosin Adewumi | Veronika Laippala | Xiaozhe Yao | Adalberto Barbosa Junior | Aleksandr Drozd | Jordan Clive | Kshitij Gupta | Liangyu Chen | Qi Sun | Ken Tsui | Nour Moustafa-Fahmy | Nicolo Monti | Tai Dang | Ziyang Luo | Tien-Tung Bui | Roberto Navigli | Virendra Mehta | Matthew Blumberg | Victor May | Hiep Nguyen | Sampo Pyysalo
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Taishi Nakamura | Mayank Mishra | Simone Tedeschi | Yekun Chai | Jason T. Stillerman | Felix Friedrich | Prateek Yadav | Tanmay Laud | Vu Minh Chien | Terry Yue Zhuo | Diganta Misra | Ben Bogin | Xuan-Son Vu | Marzena Karpinska | Arnav Varma Dantuluri | Wojciech Kusa | Tommaso Furlanello | Rio Yokota | Niklas Muennighoff | Suhas Pai | Tosin Adewumi | Veronika Laippala | Xiaozhe Yao | Adalberto Barbosa Junior | Aleksandr Drozd | Jordan Clive | Kshitij Gupta | Liangyu Chen | Qi Sun | Ken Tsui | Nour Moustafa-Fahmy | Nicolo Monti | Tai Dang | Ziyang Luo | Tien-Tung Bui | Roberto Navigli | Virendra Mehta | Matthew Blumberg | Victor May | Hiep Nguyen | Sampo Pyysalo
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Pretrained language models are integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.
Search
Fix author
Co-authors
- Tosin Adewumi 1
- Michael Aerni 1
- Badr AlKhamissi 1
- Mohammad Hossein Amani 1
- Matin Ansaripour 1
- Elliott Ash 1
- Ilia Badanin 1
- Harold Benoit 1
- Matthew Blumberg 1
- Ben Bogin 1
- Emanuela Boroş 1
- Antoine Bosselut 1
- Nicholas John Browning 1
- Tien-Tung Bui 1
- Fabian Bösch 1
- Maximilian Böther 1
- Niklas Canova 1
- Yekun Chai 1
- Camille Challier 1
- Clément Charmillot 1
- Liang-Yu Chen 1
- Tiancheng Chen 1
- Vu Minh Chien 1
- Jordan Clive 1
- Jonathan Coles 1
- Tai Dang 1
- Arnav Varma Dantuluri 1
- Jan Milan Deriu 1
- Arnout Devos 1
- Lukas Drescher 1
- Aleksandr Drozd 1
- Daniil Dzenhaliou 1
- Maud Ehrmann 1
- Dongyang Fan 1
- Simin Fan 1
- Negar Foroutan 1
- Felix Friedrich 1
- Tommaso Furlanello 1
- Silin Gao 1
- Dhia Garbaya 1
- Miguel Gila 1
- Juan Garcia Giraldo 1
- María Grandury 1
- Kshitij Gupta 1
- Çağlar Gu̇lçehre 1
- Ido Hakimi 1
- Diba Hashemi 1
- Alejandro Hernández-Cano 1
- Torsten Hoefler 1
- Alexander Miserlis Hoyle 1
- Allen Hao Huang 1
- Alexander Hägele 1
- Alexander Ilic 1
- Mete Ismayilzada 1
- Martin Jaggi 1
- Jiaming Jiang 1
- Adalberto Barbosa Junior 1
- Marzena Karpinska 1
- Mark Klein 1
- Ana Klimovic 1
- Andreas Krause 1
- Andrei Kucharavy 1
- Anastasiia Kucherenko 1
- Wojciech Kusa 1
- Veronika Laippala 1
- Tanmay Laud 1
- Ziyang Luo 1
- Frederike Lübeck 1
- Roman Machacek 1
- Theofilos Ioannis Manitaras 1
- Andreas Marfurt 1
- Inés Altemir Marinas 1
- Kyle Matoba 1
- Simon Matrenok 1
- Victor May 1
- Virendra Mehta 1
- Henrique Mendonça 1
- Bettina Messmer 1
- Mayank Mishra 1
- Diganta Misra 1
- Skander Moalla 1
- Fawzi Roberto Mohamed 1
- Syrielle Montariol 1
- Nicolo Monti 1
- Luca Mouchel 1
- Nour Moustafa-Fahmy 1
- Niklas Muennighoff 1
- Sven Najem-Meyer 1
- Taishi Nakamura 1
- Roberto Navigli 1
- Hiep Nguyen 1
- Jingwei Ni 1
- Gennaro Oliva 1
- Matteo Pagliardini 1
- Suhas Pai 1
- Elia Palme 1
- Andrei Panferov 1
- Léo Paoletti 1
- Marco Passerini 1
- Ivan Pavlov 1
- Auguste Poiroux 1
- Kaustubh Ponkshe 1
- Sampo Pyysalo 1
- Barna Pásztor 1
- Martin Rajman 1
- Nathan Ranchin 1
- Javier Rando 1
- Angelika Romanou 1
- David Rosenthal 1
- Vinko Sabolčec 1
- Mathieu Sauser 1
- Jakhongir Saydaliev 1
- Mukhammadali Sayfiddinov 1
- Imanol Schlag 1
- Marian Schneider 1
- Thomas C. Schulthess 1
- Stefano Schuppli 1
- Marco Scialanga 1
- Andrei Semenov 1
- Kumar Shridhar 1
- Raghav Singhal 1
- Antoni-Joan Solergibert 1
- Anna Sotnikova 1
- Alexander Sternfeld 1
- Jason T. Stillerman 1
- Qi Sun 1
- Ayush Kumar Tarun 1
- Simone Tedeschi 1
- Paul Teiletche 1
- Florian Tramèr 1
- Ken Tsui 1
- Jannis Vamvas 1
- Joost VandeVondele 1
- Livio Veraldi 1
- Xuan-Son Vu 1
- Yixuan Xu 1
- Prateek Yadav 1
- Rio Yokota 1
- Hao Zhao 1
- Terry Yue Zhuo 1
- Eduard Frank Ďurech 1