Praboda Rajapaksha


2025

pdf bib
SA-CLIP: Language Guided Image Spatial and Action Feature Learning
Guanlin Li | Wenhao Shao | Praboda Rajapaksha | Noel Crespi
Findings of the Association for Computational Linguistics: EMNLP 2025

We observed that Contrastive Language-Image Pretraining (CLIP) models struggle with real-world downstream tasks such as road traffic anomaly detection, due to their inability to effectively capture spatial and action relationships between objects within images. To address this, we compile and curate a dataset with 1M samples of images using language supervision provided by the common image caption dataset, in which each image is paired with subject-relationship-object descriptions emphasizing spatial and action interactions, and train a Spatial and Action relationship aware CLIP (SA-CLIP) model. We evaluated the proposed model on the Visual Spatial Reasoning (VSR) dataset and further verified its effectiveness on the Detection-of-Traffic-Anomaly (DoTA) dataset. Experiment results show that the proposed SA-CLIP demonstrates strong abilities in understanding spatial relationships while achieving good zero-shot performance on the traffic anomaly detection task.