Manar Ali
2026
Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests
Manar Ali | Judith Sieker | Sina Zarrieß | Hendrik Buschmeier
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Manar Ali | Judith Sieker | Sina Zarrieß | Hendrik Buschmeier
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
In human conversation, both interlocutors play an active role in maintaining mutual understanding. When listeners are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar listener role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a suitable testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.
2025
Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks?
Simeon Junker | Manar Ali | Larissa Koch | Sina Zarrieß | Hendrik Buschmeier
Findings of the Association for Computational Linguistics: ACL 2025
Simeon Junker | Manar Ali | Larissa Koch | Sina Zarrieß | Hendrik Buschmeier
Findings of the Association for Computational Linguistics: ACL 2025
We investigate the linguistic abilities of multimodal large language models in reference resolution tasks featuring simple yet abstract visual stimuli, such as color patches and color grids. Although the task may not seem challenging for today’s language models, being straightforward for human dyads, we consider it to be a highly relevant probe of the pragmatic capabilities of MLLMs. Our results and analyses indeed suggest that basic pragmatic capabilities, such as context-dependent interpretation of color descriptions, still constitute major challenges for state-of-the-art MLLMs.
Towards Neuro-Symbolic Approaches for Referring Expression Generation
Manar Ali | Marika Sarzotti | Simeon Junker | Hendrik Buschmeier | Sina Zarrieß
Proceedings of the 2025 CLASP Conference on Language models And RePresentations (LARP)
Manar Ali | Marika Sarzotti | Simeon Junker | Hendrik Buschmeier | Sina Zarrieß
Proceedings of the 2025 CLASP Conference on Language models And RePresentations (LARP)
Referring Expression Generation (REG) has a long-standing tradition in computational linguistics, and often aims to develop cognitively plausible models of language generation and dialogue modeling, in a multimodal context. Traditional approaches to reference have been mostly symbolic, recent ones have been mostly neural. Inspired by the recent interest in neuro-symbolic approaches in both fields – language and vision – we revisit REG from these perspectives. We review relevant neuro-symbolic approaches to language generation on the one hand and vision on the other hand, exploring possible future directions for cognitively plausible models of reference generation/reference game modeling.
Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)
Francesca Padovani | Bastian Bunzeck | Manar Ali | Omar Momen | Arianna Bisazza | Hendrik Buschmeier | Sina Zarrieß
Proceedings of the First BabyLM Workshop
Francesca Padovani | Bastian Bunzeck | Manar Ali | Omar Momen | Arianna Bisazza | Hendrik Buschmeier | Sina Zarrieß
Proceedings of the First BabyLM Workshop
We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce “more communicative” text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.