DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding

Kin Ian Lo; Hala Hawashin; Mina Abbaszadeh; Tilen Gaetano Limbäck-Stokin; Hadi Wazni; Mehrnoosh Sadrzadeh

DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding

Kin Ian Lo, Hala Hawashin, Mina Abbaszadeh, Tilen Gaetano Limbäck-Stokin, Hadi Wazni, Mehrnoosh Sadrzadeh

Abstract

Recent vision–language models excel at large-scale image–text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate–argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence’s grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP’s SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision–language tasks.

Anthology ID:: 2025.starsem-1.25
Volume:: Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Lea Frermann, Mark Stevenson
Venue:: *SEM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 316–327
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.starsem-1.25/
DOI:
Bibkey:
Cite (ACL):: Kin Ian Lo, Hala Hawashin, Mina Abbaszadeh, Tilen Gaetano Limbäck-Stokin, Hadi Wazni, and Mehrnoosh Sadrzadeh. 2025. DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding. In Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025), pages 316–327, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding (Lo et al., *SEM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.starsem-1.25.pdf

PDF Cite Search Fix data