IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Soeun Lee; Si-Woo Kim; Taewhan Kim; Dong-Jin Kim

doi:10.18653/v1/2024.emnlp-main.1153

IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Soeun Lee, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim

Abstract

Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a fusion module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap (**I**mage-like Retrieval and **F**requency-based Entity Filtering for Zero-shot **Cap**tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.

Anthology ID:: 2024.emnlp-main.1153
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20715–20727
Language:
URL:: https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.1153/
DOI:: 10.18653/v1/2024.emnlp-main.1153
Bibkey:
Cite (ACL):: Soeun Lee, Si-Woo Kim, Taewhan Kim, and Dong-Jin Kim. 2024. IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20715–20727, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning (Lee et al., EMNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.1153.pdf

PDF Search Fix data