Weiling Li
2026
LVLMs and Humans Ground Differently in Referential Communication
Peter Zeng | Weiling Li | Amie J. Paige | Zhengxiang Wang | Panagiotis Kaliosis | Dimitris Samaras | Gregory J. Zelinsky | Susan Brennan | Owen Rambow
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Peter Zeng | Weiling Li | Amie J. Paige | Zhengxiang Wang | Panagiotis Kaliosis | Dimitris Samaras | Gregory J. Zelinsky | Susan Brennan | Owen Rambow
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. We present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We show that LVLMs cannot interactively generate and resolve referring expressions in a way that enables smooth communication, a crucial skill that underlies human language use. We release our corpus of 356 dialogues (89 pairs over 4 rounds each) along with the online pipeline for data collection and the tools for analyzing accuracy, efficiency, and lexical overlap.
2025
LVLMs are Bad at Overhearing Human Referential Communication
Zhengxiang Wang | Weiling Li | Panagiotis Kaliosis | Owen Rambow | Susan Brennan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zhengxiang Wang | Weiling Li | Panagiotis Kaliosis | Owen Rambow | Susan Brennan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re-use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state-of-the-art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object-matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus and code for reproducibility and to facilitate future research.