Felix Stollenwerk


2025

pdf bib
Better Embeddings with Coupled Adam
Felix Stollenwerk | Tobias Stollenwerk
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite their remarkable capabilities, LLMs learn word representations that exhibit the undesirable yet poorly understood feature of anisotropy. In this paper, we argue that the second moment in Adam is a cause of anisotropic embeddings, and suggest a modified optimizer called Coupled Adam to mitigate the problem. Our experiments demonstrate that Coupled Adam significantly improves the quality of embeddings, while also leading to better upstream and downstream performance on large enough datasets.

pdf bib
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models
Mehdi Ali | Manuel Brack | Max Lübbering | Elias Wendt | Abbas Goher Khan | Richard Rutmann | Alex Jude | Maurice Kraus | Alexander Arno Weber | Felix Stollenwerk | David Kaczér | Florian Mai | Lucie Flek | Rafet Sifa | Nicolas Flores-Herr | Joachim Koehler | Patrick Schramowski | Michael Fromm | Kristian Kersting
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs’ annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.

2024

pdf bib
GPT-SW3: An Autoregressive Language Model for the Scandinavian Languages
Ariel Ekgren | Amaru Cuba Gyllensten | Felix Stollenwerk | Joey Öhman | Tim Isbister | Evangelia Gogoulou | Fredrik Carlsson | Judit Casademont | Magnus Sahlgren
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper details the process of developing the first native large generative language model for the North Germanic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation, applications, and considerations for release strategies. We discuss pros and cons of developing large language models for smaller languages and in relatively peripheral regions of the globe, and we hope that this paper can serve as a guide and reference for other researchers that undertake the development of large generative models for smaller languages.

2023

pdf bib
nerblackbox: A High-level Library for Named Entity Recognition in Python
Felix Stollenwerk
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

We present **nerblackbox**, a python library to facilitate the use of state-of-the-art transformer-based models for named entity recognition. It provides simple-to-use yet powerful methods to access data and models from a wide range of sources, for fully automated model training and evaluation as well as versatile model inference. While many technical challenges are solved and hidden from the user by default, **nerblackbox** also offers fine-grained control and a rich set of customizable features. It is thus targeted both at application-oriented developers as well as machine learning experts and researchers.