A Subspace-Based Analysis of Structured and Unstructured Representations in Image-Text Retrieval

Erica K. Shimomoto; Edison Marrese-Taylor; Hiroya Takamura; Ichiro Kobayashi; Yusuke Miyao

A Subspace-Based Analysis of Structured and Unstructured Representations in Image-Text Retrieval

Erica K. Shimomoto, Edison Marrese-Taylor, Hiroya Takamura, Ichiro Kobayashi, Yusuke Miyao

Abstract

In this paper, we specifically look at the image-text retrieval problem. Recent multimodal frameworks have shown that structured inputs and fine-tuning lead to consistent performance improvement. However, this paradigm has been challenged recently with newer Transformer-based models that can reach zero-shot state-of-the-art results despite not explicitly using structured data during pre-training. Since such strategies lead to increased computational resources, we seek to better understand their role in image-text retrieval by analyzing visual and text representations extracted with three multimodal frameworks – SGM, UNITER, and CLIP. To perform such analysis, we represent a single image or text as low-dimensional linear subspaces and perform retrieval based on subspace similarity. We chose this representation as subspaces give us the flexibility to model an entity based on feature sets, allowing us to observe how integrating or reducing information changes the representation of each entity. We analyze the performance of the selected models’ features on two standard benchmark datasets. Our results indicate that heavily pre-training models can already lead to features with critical information representing each entity, with zero-shot UNITER features performing consistently better than fine-tuned features. Furthermore, while models can benefit from structured inputs, learning representations for objects and relationships separately, such as in SGM, likely causes a loss of crucial contextual information needed to obtain a compact cluster that can effectively represent a single entity.

Anthology ID:: 2022.umios-1.4
Volume:: Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS)
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates (Hybrid)
Venue:: UM-IoS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 29–44
Language:
URL:: https://aclanthology.org/2022.umios-1.4
DOI:
Bibkey:
Cite (ACL):: Erica K. Shimomoto, Edison Marrese-Taylor, Hiroya Takamura, Ichiro Kobayashi, and Yusuke Miyao. 2022. A Subspace-Based Analysis of Structured and Unstructured Representations in Image-Text Retrieval. In Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS), pages 29–44, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):: A Subspace-Based Analysis of Structured and Unstructured Representations in Image-Text Retrieval (Shimomoto et al., UM-IoS 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/paclic-22-ingestion/2022.umios-1.4.pdf
Video:: https://preview.aclanthology.org/paclic-22-ingestion/2022.umios-1.4.mp4

PDF Search Video