CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling

Minghui Fang; Shengpeng Ji; Jialong Zuo; Hai Huang; Yan Xia; Jieming Zhu; Xize Cheng; Xiaoda Yang; Wenrui Liu; Gang Wang; Zhenhua Dong; Zhou Zhao

CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling

Minghui Fang, Shengpeng Ji, Jialong Zuo, Hai Huang, Yan Xia, Jieming Zhu, Xize Cheng, Xiaoda Yang, Wenrui Liu, Gang Wang, Zhenhua Dong, Zhou Zhao

Abstract

Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data. Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates, which is challenged by training cost and inference latency with large-scale data. Inspired by the remarkable performance and efficiency of generative models, we propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling, which assigns identifiers to each candidate and treats the generating identifier as the retrieval target. Specifically, we explore an effective coarse-to-fine scheme, combining K-Means and RQ-VAE to discretize multimodal data into token sequences that support autoregressive generation. Further, considering the lack of explicit interaction between queries and candidates, we propose a feature fusion strategy to align their semantics. Extensive experiments demonstrate the effectiveness of the strategies in the CART, achieving excellent results in both retrieval performance and efficiency.

Anthology ID:: 2025.acl-long.735
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15120–15133
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.735/
DOI:
Bibkey:
Cite (ACL):: Minghui Fang, Shengpeng Ji, Jialong Zuo, Hai Huang, Yan Xia, Jieming Zhu, Xize Cheng, Xiaoda Yang, Wenrui Liu, Gang Wang, Zhenhua Dong, and Zhou Zhao. 2025. CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15120–15133, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling (Fang et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.735.pdf

PDF Cite Search Fix data