2025
pdf
bib
abs
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models
Mehdi Ali
|
Manuel Brack
|
Max Lübbering
|
Elias Wendt
|
Abbas Goher Khan
|
Richard Rutmann
|
Alex Jude
|
Maurice Kraus
|
Alexander Arno Weber
|
Felix Stollenwerk
|
David Kaczér
|
Florian Mai
|
Lucie Flek
|
Rafet Sifa
|
Nicolas Flores-Herr
|
Joachim Koehler
|
Patrick Schramowski
|
Michael Fromm
|
Kristian Kersting
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs’ annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.
2024
pdf
bib
abs
Do Multilingual Large Language Models Mitigate Stereotype Bias?
Shangrui Nie
|
Michael Fromm
|
Charles Welch
|
Rebekka Görge
|
Akbar Karimi
|
Joan Plepi
|
Nazia Mowmita
|
Nicolas Flores-Herr
|
Mehdi Ali
|
Lucie Flek
Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP
While preliminary findings indicate that multilingual LLMs exhibit reduced bias compared to monolingual ones, a comprehensive understanding of the effect of multilingual training on bias mitigation, is lacking. This study addresses this gap by systematically training six LLMs of identical size (2.6B parameters) and architecture: five monolingual models (English, German, French, Italian, and Spanish) and one multilingual model trained on an equal distribution of data across these languages, all using publicly available data. To ensure robust evaluation, standard bias benchmarks were automatically translated into the five target languages and verified for both translation quality and bias preservation by human annotators. Our results consistently demonstrate that multilingual training effectively mitigates bias. Moreover, we observe that multilingual models achieve not only lower bias but also superior prediction accuracy when compared to monolingual models with the same amount of training data, model architecture, and size.
pdf
bib
abs
Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?
Alexander Arno Weber
|
Klaudia Thellmann
|
Jan Ebert
|
Nicolas Flores-Herr
|
Jens Lehmann
|
Michael Fromm
|
Mehdi Ali
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The adaption of multilingual pre-trained LLMs into eloquent and helpful assistants is essential to facilitate their use across different language regions. In that spirit, we are the first to conduct an extensive study of the performance of multilingual models instruction-tuned on different language compositions on parallel instruction-tuning benchmarks across a selection of the most spoken Indo-European languages. We systematically examine the effects of language and instruction dataset size on a mid-sized and a large, multilingual LLMs by instruction-tuning them on parallel instruction-tuning datasets. Our results demonstrate that instruction-tuning on parallel instead of monolingual corpora benefits cross-lingual instruction following capabilities by up to 9.9%. Furthermore, we show that the Superficial Alignment Hypothesis does not hold in general, as the investigated multilingual 7B parameter model presents a counter-example requiring large-scale instruction-tuning datasets. Finally, we conduct a human annotation study to understand the alignment between human-based and GPT-4-based evaluation within multilingual chat scenarios.
pdf
bib
abs
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali
|
Michael Fromm
|
Klaudia Thellmann
|
Richard Rutmann
|
Max Lübbering
|
Johannes Leveling
|
Katrin Klug
|
Jan Ebert
|
Niclas Doll
|
Jasper Buschhoff
|
Charvi Jain
|
Alexander Weber
|
Lena Jurkschat
|
Hammam Abdelwahab
|
Chelsea John
|
Pedro Ortiz Suarez
|
Malte Ostendorff
|
Samuel Weinbach
|
Rafet Sifa
|
Stefan Kesselheim
|
Nicolas Flores-Herr
Findings of the Association for Computational Linguistics: NAACL 2024
The recent success of large language models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot.Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model’s downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model’s downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.
2023
pdf
bib
abs
Cross-Domain Argument Quality Estimation
Michael Fromm
|
Max Berrendorf
|
Evgeniy Faerman
|
Thomas Seidl
Findings of the Association for Computational Linguistics: ACL 2023
Argumentation is one of society’s foundational pillars, and, sparked by advances in NLP, and the vast availability of text data, automated mining of arguments receives increasing attention. A decisive property of arguments is their strength or quality. While there are works on the automated estimation of argument strength, their scope is narrow:They focus on isolated datasets and neglect the interactions with related argument-mining tasks, such as argument identification and evidence detection. In this work, we close this gap by approaching argument quality estimation from multiple different angles:Grounded on rich results from thorough empirical evaluations, we assess the generalization capabilities of argument quality estimation across diverse domains and the interplay with related argument mining tasks. We find that generalization depends on a sufficient representation of different domains in the training part. In zero-shot transfer and multi-task experiments, we reveal that argument quality is among the more challenging tasks but can improve others. We publish our code at
https://github.com/fromm-m/acl-cross-domain-aq.
2021
pdf
bib
abs
Active Learning for Argument Strength Estimation
Nataliia Kees
|
Michael Fromm
|
Evgeniy Faerman
|
Thomas Seidl
Proceedings of the Second Workshop on Insights from Negative Results in NLP
High-quality arguments are an essential part of decision-making. Automatically predicting the quality of an argument is a complex task that recently got much attention in argument mining. However, the annotation effort for this task is exceptionally high. Therefore, we test uncertainty-based active learning (AL) methods on two popular argument-strength data sets to estimate whether sample-efficient learning can be enabled. Our extensive empirical evaluation shows that uncertainty-based acquisition functions can not surpass the accuracy reached with the random acquisition on these data sets.