Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

Wei Li; Zhen Huang; Xinmei Tian

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

Abstract

Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a “bag-of-words” behavior—struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose **MACCO** (**MA**sked **C**ompositional **C**oncept M**O**deling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model.

Anthology ID:: 2026.acl-long.1490
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32284–32308
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1490/
DOI:
Bibkey:
Cite (ACL):: Wei Li, Zhen Huang, and Xinmei Tian. 2026. Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32284–32308, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality (Li et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1490.pdf
Checklist:: 2026.acl-long.1490.checklist.pdf

PDF Cite Search Checklist Fix data