Yan Weiser


2025

While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model’s capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of models, ensuring the continued relevance of the benchmark.

2024

This paper introduces a novel approach to form-filling and dialogue system evaluation by leveraging Large Language Models (LLMs). The proposed method establishes a setup wherein multiple modules collaborate on addressing the form-filling task. The dialogue system is constructed on top of LLMs, focusing on defining specific roles for individual modules. We show that using multiple independent sub-modules working cooperatively on this task can improve performance and handle the typical constraints of using LLMs, such as context limitations. The study involves testing the modular setup on four selected forms of varying topics and lengths, employing commercial and open-access LLMs. The experimental results demonstrate that the modular setup consistently outperforms the baseline, showcasing the effectiveness of this approach. Furthermore, our findings reveal that open-access models perform comparably to commercial models for the specified task.