Abstract
In text-conditioned image retrieval (TCIR), the combination of a reference image and modification text forms a query tuple, aiming to locate the most congruent target image within a dataset. The advantages of rich image semantic information and text flexibility are combined in this manner for more accurate retrieval. While traditional techniques often employ attention-driven compositors to craft a unified image-text representation, our paper introduces a compositor-free framework, CF-TCIR, which eschews the standard compositor. Compositor-based methods are designed to learn a joint representation of images and text, but they struggle to directly capture the correlations between attributes across the image and text modalities. Instead, we reformulate the retrieval process as a cross-modal interaction between a synthesized image feature and its corresponding text descriptor. This novel methodology offers advantages in terms of computational efficiency, scalability, and superior performance. To optimize the retrieval performance, we advocate a tiered retrieval mechanism, blending both coarse-grain and fine-grain paradigms. Moreover, to enrich the contextual relationship within the query tuple, we integrate a generative cross-modal alignment technique, ensuring synchronization of sequential attributes between image and text data.- Anthology ID:
- 2024.findings-acl.965
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2024
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 16315–16325
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2024.findings-acl.965/
- DOI:
- 10.18653/v1/2024.findings-acl.965
- Cite (ACL):
- Yuchen Yang, Yu Wang, and Yanfeng Wang. 2024. CF-TCIR: A Compositor-Free Framework for Hierarchical Text-Conditioned Image Retrieval. In Findings of the Association for Computational Linguistics: ACL 2024, pages 16315–16325, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- CF-TCIR: A Compositor-Free Framework for Hierarchical Text-Conditioned Image Retrieval (Yang et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2024.findings-acl.965.pdf