Bridging Semantic and Modality Gaps in Zero-Shot Captioning via Retrieval from Synthetic Data

Zhiyue Liu, Wenkai Zhou


Abstract
Zero-shot image captioning, which aims to generate image descriptions without relying on annotated data, has recently attracted increasing research interest. Pre-trained text-to-image generation models enable the creation of synthetic pairs solely from text data, while existing methods fall short in mitigating the discrepancy caused by the inability of synthetic images to fully capture the semantics of the textual input, resulting in unreliable cross-modal correspondences. To address this, we propose a retrieval-based framework that leverages only existing synthetic image-text pairs as its search corpus to systematically bridge the gap when using synthetic data for captioning. For the semantic gap between a synthetic image and its input text, our framework retrieves supplementary visual features from similar synthetic examples and integrates them to refine the image embedding. Then, it extracts image-related textual descriptions to mitigate the modality gap during decoding. Moreover, we introduce a plug-and-play visual semantic module that detects visual entities, further facilitating the construction of semantic correspondences between images and text. Experimental results on benchmark datasets demonstrate that our method obtains state-of-the-art results.
Anthology ID:
2025.findings-emnlp.754
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14010–14023
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.754/
DOI:
10.18653/v1/2025.findings-emnlp.754
Bibkey:
Cite (ACL):
Zhiyue Liu and Wenkai Zhou. 2025. Bridging Semantic and Modality Gaps in Zero-Shot Captioning via Retrieval from Synthetic Data. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 14010–14023, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Bridging Semantic and Modality Gaps in Zero-Shot Captioning via Retrieval from Synthetic Data (Liu & Zhou, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.754.pdf
Checklist:
 2025.findings-emnlp.754.checklist.pdf