SA-CLIP: Language Guided Image Spatial and Action Feature Learning

Guanlin Li; Wenhao Shao; Praboda Rajapaksha; Noel Crespi

doi:10.18653/v1/2025.findings-emnlp.1134

SA-CLIP: Language Guided Image Spatial and Action Feature Learning

Guanlin Li, Wenhao Shao, Praboda Rajapaksha, Noel Crespi

Abstract

We observed that Contrastive Language-Image Pretraining (CLIP) models struggle with real-world downstream tasks such as road traffic anomaly detection, due to their inability to effectively capture spatial and action relationships between objects within images. To address this, we compile and curate a dataset with 1M samples of images using language supervision provided by the common image caption dataset, in which each image is paired with subject-relationship-object descriptions emphasizing spatial and action interactions, and train a Spatial and Action relationship aware CLIP (SA-CLIP) model. We evaluated the proposed model on the Visual Spatial Reasoning (VSR) dataset and further verified its effectiveness on the Detection-of-Traffic-Anomaly (DoTA) dataset. Experiment results show that the proposed SA-CLIP demonstrates strong abilities in understanding spatial relationships while achieving good zero-shot performance on the traffic anomaly detection task.

Anthology ID:: 2025.findings-emnlp.1134
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20808–20814
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1134/
DOI:: 10.18653/v1/2025.findings-emnlp.1134
Bibkey:
Cite (ACL):: Guanlin Li, Wenhao Shao, Praboda Rajapaksha, and Noel Crespi. 2025. SA-CLIP: Language Guided Image Spatial and Action Feature Learning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 20808–20814, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: SA-CLIP: Language Guided Image Spatial and Action Feature Learning (Li et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1134.pdf
Checklist:: 2025.findings-emnlp.1134.checklist.pdf

PDF Cite Search Checklist Fix data