Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating

Dong Zhang, Haiyan Tian, Qingying Sun, Shoushan Li


Abstract
This paper presents a novel framework for vision-aided unsupervised constituency parsing (VUCP), leveraging multimodal large language models (MLLMs) pre-trained on diverse image-text or video-text data. Unlike previous methods requiring explicit cross-modal alignment, our approach eliminates this need by using pre-trained models like Qwen-VL and VideoLLaVA, which seamlessly handle multimodal inputs. We introduce two multi-agent debating mechanisms—consensus-driven (CD) and round-driven (RD)—to enable cooperation between models with complementary strengths. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on both image-text and video-text datasets for VUCP, improving robustness and accuracy.
Anthology ID:
2025.findings-acl.353
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6800–6810
Language:
URL:
https://preview.aclanthology.org/transition-to-people-yaml/2025.findings-acl.353/
DOI:
10.18653/v1/2025.findings-acl.353
Bibkey:
Cite (ACL):
Dong Zhang, Haiyan Tian, Qingying Sun, and Shoushan Li. 2025. Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating. In Findings of the Association for Computational Linguistics: ACL 2025, pages 6800–6810, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating (Zhang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/transition-to-people-yaml/2025.findings-acl.353.pdf