CF-TCIR: A Compositor-Free Framework for Hierarchical Text-Conditioned Image Retrieval

Yuchen Yang, Yu Wang, Yanfeng Wang


Abstract
In text-conditioned image retrieval (TCIR), the combination of a reference image and modification text forms a query tuple, aiming to locate the most congruent target image within a dataset. The advantages of rich image semantic information and text flexibility are combined in this manner for more accurate retrieval. While traditional techniques often employ attention-driven compositors to craft a unified image-text representation, our paper introduces a compositor-free framework, CF-TCIR, which eschews the standard compositor. Compositor-based methods are designed to learn a joint representation of images and text, but they struggle to directly capture the correlations between attributes across the image and text modalities. Instead, we reformulate the retrieval process as a cross-modal interaction between a synthesized image feature and its corresponding text descriptor. This novel methodology offers advantages in terms of computational efficiency, scalability, and superior performance. To optimize the retrieval performance, we advocate a tiered retrieval mechanism, blending both coarse-grain and fine-grain paradigms. Moreover, to enrich the contextual relationship within the query tuple, we integrate a generative cross-modal alignment technique, ensuring synchronization of sequential attributes between image and text data.
Anthology ID:
2024.findings-acl.965
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16315–16325
Language:
URL:
https://aclanthology.org/2024.findings-acl.965
DOI:
10.18653/v1/2024.findings-acl.965
Bibkey:
Cite (ACL):
Yuchen Yang, Yu Wang, and Yanfeng Wang. 2024. CF-TCIR: A Compositor-Free Framework for Hierarchical Text-Conditioned Image Retrieval. In Findings of the Association for Computational Linguistics: ACL 2024, pages 16315–16325, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
CF-TCIR: A Compositor-Free Framework for Hierarchical Text-Conditioned Image Retrieval (Yang et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2024.findings-acl.965.pdf