Jan Ebert
2026
Synthetic Instruction Generation for Low-Resource Nordic Languages: Viability and Limitations in LLM Instruction-Tuning
Mathias Stenlund | Annika Simonsen | Lars Bungum | Jan Ebert | Jiangtao Wang | Oleg Filatov | Hemanadhan Myneni | Morris Riedel | Hafsteinn Einarsson
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Mathias Stenlund | Annika Simonsen | Lars Bungum | Jan Ebert | Jiangtao Wang | Oleg Filatov | Hemanadhan Myneni | Morris Riedel | Hafsteinn Einarsson
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Pretrained large language models (LLMs) gain instruction-following abilities through instruction-tuning, a method which relies on datasets of instruction–response pairs. However, for low-resource languages, collecting human-authored instructions is costly, raising the question of whether synthetic instructions can substitute human-authored instructions for non-English languages. We compare instruction-tuning of a smaller pretrained LLM in four Nordic languages using (a) human-authored instructions paired with synthetic responses and (b) fully synthetic instruction–response pairs generated with a minimal-effort pipeline. Native-speaker evaluations show that models instruction-tuned on synthetic instructions perform on par with those trained on human-authored instructions for the largest Nordic languages, suggesting that minimal-effort synthetic instructions can serve as a practical alternative. In contrast, response quality deteriorates sharply for Icelandic, underscoring the limitations of current synthetic data generation pipelines when the LLM competence in the target language is weak. Overall, our results highlight that while synthetic instructions can enable cost-efficient instruction-tuning for the largest Nordic languages, they remain insufficient for Icelandic, clarifying when minimal-effort synthetic approaches suffice and when they fall short.
2024
Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?
Alexander Arno Weber | Klaudia Thellmann | Jan Ebert | Nicolas Flores-Herr | Jens Lehmann | Michael Fromm | Mehdi Ali
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Alexander Arno Weber | Klaudia Thellmann | Jan Ebert | Nicolas Flores-Herr | Jens Lehmann | Michael Fromm | Mehdi Ali
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The adaption of multilingual pre-trained LLMs into eloquent and helpful assistants is essential to facilitate their use across different language regions. In that spirit, we are the first to conduct an extensive study of the performance of multilingual models instruction-tuned on different language compositions on parallel instruction-tuning benchmarks across a selection of the most spoken Indo-European languages. We systematically examine the effects of language and instruction dataset size on a mid-sized and a large, multilingual LLMs by instruction-tuning them on parallel instruction-tuning datasets. Our results demonstrate that instruction-tuning on parallel instead of monolingual corpora benefits cross-lingual instruction following capabilities by up to 9.9%. Furthermore, we show that the Superficial Alignment Hypothesis does not hold in general, as the investigated multilingual 7B parameter model presents a counter-example requiring large-scale instruction-tuning datasets. Finally, we conduct a human annotation study to understand the alignment between human-based and GPT-4-based evaluation within multilingual chat scenarios.
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali | Michael Fromm | Klaudia Thellmann | Richard Rutmann | Max Lübbering | Johannes Leveling | Katrin Klug | Jan Ebert | Niclas Doll | Jasper Buschhoff | Charvi Jain | Alexander Weber | Lena Jurkschat | Hammam Abdelwahab | Chelsea John | Pedro Ortiz Suarez | Malte Ostendorff | Samuel Weinbach | Rafet Sifa | Stefan Kesselheim | Nicolas Flores-Herr
Findings of the Association for Computational Linguistics: NAACL 2024
Mehdi Ali | Michael Fromm | Klaudia Thellmann | Richard Rutmann | Max Lübbering | Johannes Leveling | Katrin Klug | Jan Ebert | Niclas Doll | Jasper Buschhoff | Charvi Jain | Alexander Weber | Lena Jurkschat | Hammam Abdelwahab | Chelsea John | Pedro Ortiz Suarez | Malte Ostendorff | Samuel Weinbach | Rafet Sifa | Stefan Kesselheim | Nicolas Flores-Herr
Findings of the Association for Computational Linguistics: NAACL 2024
The recent success of large language models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot.Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model’s downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model’s downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.
Search
Fix author
Co-authors
- Mehdi Ali 2
- Nicolas Flores-Herr 2
- Michael Fromm 2
- Klaudia Thellmann 2
- Hammam Abdelwahab 1
- Lars Bungum 1
- Jasper Buschhoff 1
- Niclas Doll 1
- Hafsteinn Einarsson 1
- Oleg Filatov 1
- Charvi Jain 1
- Chelsea John 1
- Lena Jurkschat 1
- Stefan Kesselheim 1
- Katrin Klug 1
- Jens Lehmann 1
- Johannes Leveling 1
- Max Lübbering 1
- Hemanadhan Myneni 1
- Pedro Ortiz Suarez 1
- Malte Ostendorff 1
- Morris Riedel 1
- Richard Rutmann 1
- Rafet Sifa 1
- Annika Simonsen 1
- Mathias Stenlund 1
- Jiangtao Wang 1
- Alexander Arno Weber 1
- Alexander Weber 1
- Samuel Weinbach 1