Cross-modal Clustering-based Retrieval for Scalable and Robust Image Captioning

Jingyi You, Hiroshi Sasaki, Kazuma Kadowaki


Abstract
Recent advances in retrieval-augmented generative image captioning (RAG-IC) have significantly improved caption quality by incorporating external knowledge and similar examples into language model-driven caption generators. However, these methods still encounter challenges when applied to real-world scenarios. First, many existing approaches rely on bimodal retrieval datastores that require large amounts of labeled data and substantial manual effort to construct, making them costly and time-consuming. Moreover, they simply retrieve the nearest samples to the input query from datastores, which leads to high redundancy in the retrieved content and subsequently degrades the quality of the generated captions. In this paper, we introduce a novel RAG-IC approach named Cross-modal Diversity-promoting Retrieval technique (CoDiRet), which integrates a text-only unimodal retrieval module with our unique cluster-based retrieval mechanism. This proposal simultaneously enhances the scalability of the datastore, promotes diversity in retrieved content, and improves robustness against out-of-domain inputs, which eventually facilitates real-world applications. Experimental results demonstrate that our method, despite being exclusively trained on the COCO benchmark dataset, achieves competitive performance on the in-domain benchmark and generalizes robustly across different domains without additional training.
Anthology ID:
2025.magmar-1.4
Volume:
Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025)
Month:
August
Year:
2025
Address:
Vienna, Austria
Editors:
Reno Kriz, Kenton Murray
Venues:
MAGMaR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
47–58
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.magmar-1.4/
DOI:
Bibkey:
Cite (ACL):
Jingyi You, Hiroshi Sasaki, and Kazuma Kadowaki. 2025. Cross-modal Clustering-based Retrieval for Scalable and Robust Image Captioning. In Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025), pages 47–58, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Cross-modal Clustering-based Retrieval for Scalable and Robust Image Captioning (You et al., MAGMaR 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.magmar-1.4.pdf