DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding
Kin Ian Lo, Hala Hawashin, Mina Abbaszadeh, Tilen Gaetano Limbäck-Stokin, Hadi Wazni, Mehrnoosh Sadrzadeh
Abstract
Recent vision–language models excel at large-scale image–text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate–argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence’s grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP’s SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision–language tasks.- Anthology ID:
- 2025.starsem-1.25
- Volume:
- Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Lea Frermann, Mark Stevenson
- Venue:
- *SEM
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 316–327
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.starsem-1.25/
- DOI:
- Cite (ACL):
- Kin Ian Lo, Hala Hawashin, Mina Abbaszadeh, Tilen Gaetano Limbäck-Stokin, Hadi Wazni, and Mehrnoosh Sadrzadeh. 2025. DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding. In Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025), pages 316–327, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding (Lo et al., *SEM 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.starsem-1.25.pdf