Julia Wunderle
2026
New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models
Julia Wunderle | Anton Ehrmanntraut | Jan Pfister | Fotis Jannidis | Andreas Hotho
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Julia Wunderle | Anton Ehrmanntraut | Jan Pfister | Fotis Jannidis | Andreas Hotho
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Encoders remain essential for efficient German NLP and NLU scenarios despite the rise of decoder-only LLMs. This work studies two routes to high-quality German encoders under identical data and training constraints: a) training from scratch and b) converting decoders via LLMVec. We introduce two resources: ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT style, and LLäMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions trained with masked next-token prediction, both undergoing a context extension to 8192 tokens. Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808), surpassing GBERTlarge (+4%) and the seven-times larger converted 7B model (0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557). We release all models, checkpoints, datasets, and full training records, and introduce an encoder-adapted QA-NIAH evaluation. All in all, our results provide actionable guidance: when parameter efficiency and latency matter, from-scratch encoders dominate. When a pre-trained decoder exists and compute is a limited, conversion offers an effective alternative.
2025
Die SuperGLEBer at GermEval 2025 Shared Tasks: Growing Pains - When More Isn’t Always Better
Julia Wunderle | Jan Pfister | Andreas Hotho
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops
Julia Wunderle | Jan Pfister | Andreas Hotho
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops
LLäMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch
Jan Pfister | Julia Wunderle | Andreas Hotho
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jan Pfister | Julia Wunderle | Andreas Hotho
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We transparently create two German-only decoder models, LLäMmlein 120M and 1B, from scratch and publish them, along with the training data, for the (German) NLP research community to use. The model training involved several key steps, including data preprocessing/filtering, the creation of a German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks, also against existing models. Throughout the training process, multiple checkpoints were saved in equal intervals and analyzed using the German SuperGLEBer benchmark to gain insights into the models’ learning process.Compared to state-of-the-art models on the SuperGLEBer benchmark, both LLäMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models’ quality scales with size as expected, but performance improvements on some tasks plateaued early during training, offering valuable insights into resource allocation for future models.
2024
OtterlyObsessedWithSemantics at SemEval-2024 Task 4: Developing a Hierarchical Multi-Label Classification Head for Large Language Models
Julia Wunderle | Julian Schubert | Antonella Cacciatore | Albin Zehe | Jan Pfister | Andreas Hotho
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Julia Wunderle | Julian Schubert | Antonella Cacciatore | Albin Zehe | Jan Pfister | Andreas Hotho
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
For our submission for Subtask 1, we developed a custom classification head that is designed to be applied atop of a Large Language Model. We reconstructed the hierarchy across multiple fully connected layers, allowing us to incorporate previous foundational decisions in subsequent, more fine-grained layers. To find the best hyperparameters, we conducted a grid-search and to compete in the multilingual setting, we translated all documents to English.