Andrea Blasi Núñez

2026

With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert-level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model’s wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 questions for wine theory question answering, 1,000 examples for wine feature completion, and 1,000 examples of food-wine pairing. We provide results for the most popular language models, including closed-weights models such as Gemini 2.5, and open-weights models, such as GPT-OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed-weights model), yet feature completion (peaking at 65%) and food-wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at https://github.com/sommify/sommbench.

Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over five times as many tokens as comparable releases, is exclusively openly licensed, and has received multiple contributions across industry, the public sector and research institutions. The repository includes light-weight tests to ensure data formatting, quality, and documentation, establishing a sustainable framework for ongoing community contributions and dataset evolution.

2025

pdf bib abs

MLDataForge: Accelerating Large-Scale Dataset Preprocessing and Access for Multimodal Foundation Model Training
Andrea Blasi Núñez | Lukas Paul Achatius Galke | Peter Schneider-Kamp
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Preprocessing large and possibly multimodal datasets remains a key bottleneck in many machine learning workflows, particularly when random access to samples is needed for global shuffling and sorting. Existing approaches, including widely used formats like JSONL and frameworks such as Huggingface Datasets and MosaicML Streaming, typically incur substantial computational, memory, and storage overhead in such settings. Here, we introduce MLDataForge, a Python-based open-source framework designed for scalable dataset pre-processing and access. Our key contributions are: (1) optimized readers for Mosaic Data Shards (MDS) that substantially improve throughput, reduce peak storage usage, and support sample-level compression; (2) JINX (JSON Indexed ’N’ eXtended), a novel, index-augmented JSONL-compatible format supporting structured footers and binary sidecar files; and (3) a lazy-loading mechanism that defers data loading, decompression, and decoding JINX files until sample fields are accessed. We empirically evaluate MLDataForge and our contributions on a representative 200 GB supervised fine-tuning dataset for vision language models. Our best configuration – zstd-compressed JINX with binary sidecar and lazy loading – yields at least a decimal order-of-magnitude throughput increase compared to the best baselines for iteration, global shuffling, and sorting. These advances enable substantial gains in data preprocessing performance, facilitating more scalable and resource-efficient model training pipelines.