Felix Friedrich
2026
SLR: Automated Synthesis for Scalable Logical Reasoning
Lukas Helff | Ahmad Omar | Felix Friedrich | Antonia W\"ust | Hikaru Shindo | Rupert Mitchell | Tim Woydt | Patrick Schramowski | Wolfgang Stammer | Kristian Kersting
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lukas Helff | Ahmad Omar | Felix Friedrich | Antonia W\"ust | Hikaru Shindo | Rupert Mitchell | Tim Woydt | Patrick Schramowski | Wolfgang Stammer | Kristian Kersting
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user’s task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding 300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.
2025
Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
Taishi Nakamura | Mayank Mishra | Simone Tedeschi | Yekun Chai | Jason T. Stillerman | Felix Friedrich | Prateek Yadav | Tanmay Laud | Vu Minh Chien | Terry Yue Zhuo | Diganta Misra | Ben Bogin | Xuan-Son Vu | Marzena Karpinska | Arnav Varma Dantuluri | Wojciech Kusa | Tommaso Furlanello | Rio Yokota | Niklas Muennighoff | Suhas Pai | Tosin Adewumi | Veronika Laippala | Xiaozhe Yao | Adalberto Barbosa Junior | Aleksandr Drozd | Jordan Clive | Kshitij Gupta | Liangyu Chen | Qi Sun | Ken Tsui | Nour Moustafa-Fahmy | Nicolo Monti | Tai Dang | Ziyang Luo | Tien-Tung Bui | Roberto Navigli | Virendra Mehta | Matthew Blumberg | Victor May | Hiep Nguyen | Sampo Pyysalo
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Taishi Nakamura | Mayank Mishra | Simone Tedeschi | Yekun Chai | Jason T. Stillerman | Felix Friedrich | Prateek Yadav | Tanmay Laud | Vu Minh Chien | Terry Yue Zhuo | Diganta Misra | Ben Bogin | Xuan-Son Vu | Marzena Karpinska | Arnav Varma Dantuluri | Wojciech Kusa | Tommaso Furlanello | Rio Yokota | Niklas Muennighoff | Suhas Pai | Tosin Adewumi | Veronika Laippala | Xiaozhe Yao | Adalberto Barbosa Junior | Aleksandr Drozd | Jordan Clive | Kshitij Gupta | Liangyu Chen | Qi Sun | Ken Tsui | Nour Moustafa-Fahmy | Nicolo Monti | Tai Dang | Ziyang Luo | Tien-Tung Bui | Roberto Navigli | Virendra Mehta | Matthew Blumberg | Victor May | Hiep Nguyen | Sampo Pyysalo
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Pretrained language models are integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.
Multilingual Text-to-Image Generation Magnifies Gender Stereotypes
Felix Friedrich | Katharina Hämmerl | Patrick Schramowski | Manuel Brack | Jindřich Libovický | Kristian Kersting | Alexander Fraser
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Felix Friedrich | Katharina Hämmerl | Patrick Schramowski | Manuel Brack | Jindřich Libovický | Kristian Kersting | Alexander Fraser
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Text-to-image (T2I) generation models have achieved great results in image quality, flexibility, and text alignment, leading to widespread use. Through improvements in multilingual abilities, a larger community can access this technology. Yet, we show that multilingual models suffer from substantial gender bias. Furthermore, the expectation that results should be similar across languages does not hold. We introduce MAGBIG, a controlled benchmark designed to study gender bias in multilingual T2I models, and use it to assess the impact of multilingualism on gender bias. To this end, we construct a set of multilingual prompts that offers a carefully controlled setting accounting for the complex grammatical differences influencing gender across languages. Our results show strong gender biases and notable language-specific differences across models. While we explore prompt engineering strategies to mitigate these biases, we find them largely ineffective and sometimes even detrimental to text-to-image alignment. Our analysis highlights the need for research on diverse language representations and greater control over bias in T2I models.
Search
Fix author
Co-authors
- Kristian Kersting 2
- Patrick Schramowski 2
- Tosin Adewumi 1
- Matthew Blumberg 1
- Ben Bogin 1
- Manuel Brack 1
- Tien-Tung Bui 1
- Yekun Chai 1
- Liang-Yu Chen 1
- Vu Minh Chien 1
- Jordan Clive 1
- Tai Dang 1
- Arnav Varma Dantuluri 1
- Aleksandr Drozd 1
- Alexander Fraser 1
- Tommaso Furlanello 1
- Kshitij Gupta 1
- Lukas Helff 1
- Katharina Hämmerl 1
- Adalberto Barbosa Junior 1
- Marzena Karpinska 1
- Wojciech Kusa 1
- Veronika Laippala 1
- Tanmay Laud 1
- Jindřich Libovický 1
- Ziyang Luo 1
- Victor May 1
- Virendra Mehta 1
- Mayank Mishra 1
- Diganta Misra 1
- Rupert Mitchell 1
- Nicolo Monti 1
- Nour Moustafa-Fahmy 1
- Niklas Muennighoff 1
- Taishi Nakamura 1
- Roberto Navigli 1
- Hiep Nguyen 1
- Ahmad Omar 1
- Suhas Pai 1
- Sampo Pyysalo 1
- Hikaru Shindo 1
- Wolfgang Stammer 1
- Jason T. Stillerman 1
- Qi Sun 1
- Simone Tedeschi 1
- Ken Tsui 1
- Xuan-Son Vu 1
- Antonia W\"ust 1
- Tim Woydt 1
- Prateek Yadav 1
- Xiaozhe Yao 1
- Rio Yokota 1
- Terry Yue Zhuo 1