Jacob Nielsen
2026
SommBench: Assessing Sommelier Expertise of Language Models
William Brach | Tomas Bedej | Jacob Nielsen | Jacob Pichna | Juraj Bedej | Eemeli Saarensilta | Julie Dupouy | Gianluca Barmina | Andrea Blasi Núñez | Peter Schneider-Kamp | Kristian Košťál | Michal Ries | Lukas Galke Poech
Proceedings of the Fifteenth Language Resources and Evaluation Conference
William Brach | Tomas Bedej | Jacob Nielsen | Jacob Pichna | Juraj Bedej | Eemeli Saarensilta | Julie Dupouy | Gianluca Barmina | Andrea Blasi Núñez | Peter Schneider-Kamp | Kristian Košťál | Michal Ries | Lukas Galke Poech
Proceedings of the Fifteenth Language Resources and Evaluation Conference
With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert-level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model’s wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 questions for wine theory question answering, 1,000 examples for wine feature completion, and 1,000 examples of food-wine pairing. We provide results for the most popular language models, including closed-weights models such as Gemini 2.5, and open-weights models, such as GPT-OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed-weights model), yet feature completion (peaking at 65%) and food-wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at https://github.com/sommify/sommbench.
Dynaword: From One-shot to Continuously Developed Datasets
Kenneth Enevoldsen | Kristian Nørgaard Jensen | Jan Kostkan | Balázs Szabó | Márton Kardos | Kirsten Vad | Johan Heinsen | Andrea Blasi Núñez | Gianluca Barmina | Jacob Nielsen | Rasmus Larsen | Rob van der Goot | Peter Vahlstrup | Per Møldrup Dalum | Desmond Elliott | Lukas Galke Poech | Peter Schneider-Kamp | Kristoffer Nielbo
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Kenneth Enevoldsen | Kristian Nørgaard Jensen | Jan Kostkan | Balázs Szabó | Márton Kardos | Kirsten Vad | Johan Heinsen | Andrea Blasi Núñez | Gianluca Barmina | Jacob Nielsen | Rasmus Larsen | Rob van der Goot | Peter Vahlstrup | Per Møldrup Dalum | Desmond Elliott | Lukas Galke Poech | Peter Schneider-Kamp | Kristoffer Nielbo
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over five times as many tokens as comparable releases, is exclusively openly licensed, and has received multiple contributions across industry, the public sector and research institutions. The repository includes light-weight tests to ensure data formatting, quality, and documentation, establishing a sustainable framework for ongoing community contributions and dataset evolution.
2025
Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?
Jacob Nielsen | Peter Schneider-Kamp | Lukas Galke
Findings of the Association for Computational Linguistics: ACL 2025
Jacob Nielsen | Peter Schneider-Kamp | Lukas Galke
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) require immense resources for training and inference. Quantization, a technique that reduces the precision of model parameters, offers a promising solution for improving LLM efficiency and sustainability. While post-training quantization methods typically achieve 4-8 bits per parameter, recent research suggests that training LLMs with 1.58 bits per weight parameter from scratch can maintain model accuracy while greatly reducing memory requirements and energy consumption at inference time. Here, we investigate a training strategy for quantization-aware pre-training, where the models are first trained with 16-bit precision and then transition into 1.58-bit quantization-aware training. Our results on 11 downstream tasks, show that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit training and leaves models closer to those which have undergone 16-bit training. We further investigate the effects of retaining the optimizer state at the transition point and gradually phasing in quantization strength - finding that both techniques alleviate the magnitude of loss spikes, but also that these effects can be compensated through further training.
Search
Fix author
Co-authors
- Lukas Galke Poech 3
- Peter Schneider-Kamp 3
- Gianluca Barmina 2
- Andrea Blasi Núñez 2
- Tomas Bedej 1
- Juraj Bedej 1
- William Brach 1
- Per Møldrup Dalum 1
- Julie Dupouy 1
- Desmond Elliott 1
- Kenneth Enevoldsen 1
- Rob Van Der Goot 1
- Johan Heinsen 1
- Kristian Nørgaard Jensen 1
- Márton Kardos 1
- Jan Kostkan 1
- Kristian Košťál 1
- Rasmus Larsen 1
- Kristoffer Nielbo 1
- Jacob Pichna 1
- Michal Ries 1
- Eemeli Saarensilta 1
- Balázs Szabó 1
- Kirsten Vad 1
- Peter Vahlstrup 1