Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs

Israfel Salazar, Desmond Elliott, Yova Kementchedjhieva


Abstract
Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While these capabilities are often assumed to be closely related, the conditions under which they reinforce each other remain unclear. In this paper, we empirically analyze when compositional reasoning and long-caption understanding transfer across tasks, and when this relationship fails. Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between the two capabilities. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. We further show that architectural choices aimed at preserving general alignment, such as frozen positional embeddings, can inadvertently limit compositional learning. Our analysis provides actionable guidelines for data selection and model design to improve VLM generalization.
Anthology ID:
2026.findings-acl.2131
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
42994–43008
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2131/
DOI:
Bibkey:
Cite (ACL):
Israfel Salazar, Desmond Elliott, and Yova Kementchedjhieva. 2026. Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 42994–43008, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs (Salazar et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2131.pdf
Checklist:
 2026.findings-acl.2131.checklist.pdf