Sebastian Houben


2026

Vision-Language Models (VLMs) have achieved impressive progress across diverse multimodal tasks, yet their ability to interpret structured diagrams, such as circuit schematics, molecular structures, musical notation, business process flow charts or class diagrams, which are central to scientific and engineering communication, remains underexplored. We introduce STRUDEL (STRUctured Diagram EvaLuation), a benchmark for evaluating VLMs on structured diagram understanding across 8 domains and 20 image categories. STRUDEL leverages Large-Language Models (LLMs) to synthesize code in domain-specific formal representation languages (FRLs) (e.g. circuit netlists, SMILES, ABC-Notation, BPMN or PlantUML), which are rendered into valid diagrams and paired with generated tasks, functional descriptions, and captions. A multi-stage pipeline filters invalid, cluttered, or redundant samples and employs LLM-as-a-judge scoring to ensure correctness. Through targeted experiments, we evaluate the ability of LLMs to generate valid code in distinct FRLs, demonstrating their capability to successfully perform this task. The resulting benchmark comprises diverse task types covering identification, quantification, structural analysis, image-text association, and image-to-code translation. Evaluating 35 VLMs using STRUDEL reveals that models excel at association tasks, demonstrating strong visual-textual alignment, yet struggle with quantification and identification, where precise structural understanding is required. Performance varies markedly in image-to-code translation, reflecting significant differences in how models connect visual inputs to formal representations. Overall, STRUDEL establishes a scalable foundation for assessing and advancing VLMs torward deeper and more systematic understanding of structured visual information across domains.

2025

Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model’s predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation because of the limitations of machine translation from English to a language such as German with grammatical gender. Our final dataset is comprised of two subsets: Subset-I, which consists of group terms related to gender identity, and Subset-II, where group terms are replaced with proper names. We evaluate several LLMs used for German NLP on this newly created dataset and report the accuracy and bias scores. The results show that all models exhibit bias, both along and against existing social stereotypes.
To ensure reliable performance of Question Answering (QA) systems, evaluation of robustness is crucial. Common evaluation benchmarks commonly only include performance metrics, such as Exact Match (EM) and the F1 score. However, these benchmarks overlook critical factors for the deployment of QA systems. This oversight can result in systems vulnerable to minor perturbations in the input such as typographical errors. While several methods have been proposed to test the robustness of QA models, there has been minimal exploration of these approaches for languages other than English. This study focuses on the robustness evaluation of German language QA models, extending methodologies previously applied primarily to English. The objective is to nurture the development of robust models by defining an evaluation method specifically tailored to the German language. We assess the applicability of perturbations used in English QA models for German and perform a comprehensive experimental evaluation with eight models. The results show that all models are vulnerable to character-level perturbations. Additionally, the comparison of monolingual and multilingual models suggest that the former are less affected by character and word-level perturbations.