Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models

Bram Willemsen, Gabriel Skantze


Abstract
In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.
Anthology ID:
2025.xllm-1.6
Volume:
Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)
Month:
August
Year:
2025
Address:
Vienna, Austria
Editors:
Hao Fei, Kewei Tu, Yuhui Zhang, Xiang Hu, Wenjuan Han, Zixia Jia, Zilong Zheng, Yixin Cao, Meishan Zhang, Wei Lu, N. Siddharth, Lilja Øvrelid, Nianwen Xue, Yue Zhang
Venues:
XLLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
49–60
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.xllm-1.6/
DOI:
Bibkey:
Cite (ACL):
Bram Willemsen and Gabriel Skantze. 2025. Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models. In Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025), pages 49–60, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models (Willemsen & Skantze, XLLM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.xllm-1.6.pdf