DRUM: Learning Demonstration Retriever for Large MUlti-modal Models

Ellen Yi-Ge, Jiechao Gao, Wei Han, Wei Zhu


Abstract
Recently, large language models (LLMs) have demonstrated impressive capabilities in dealing with new tasks with the help of in-context learning (ICL). In the study of Large Vision-Language Models (LVLMs), when implementing ICL, researchers usually adopt the naive strategies like fixed demonstrations across different samples, or selecting demonstrations directly via a visual-language embedding model. These methods do not guarantee the configured demonstrations fit the need of the LVLMs. To address this issue, we propose a novel framework, demonstration retriever for large multi-modal model (DRUM), which fine-tunes the CLIP embedding model to better meet the LVLM’s needs. First, we discuss the retrieval strategies for a visual-language task, assuming an embedding model is given. And we propose to concate the image and text embeddings to enhance the retrieval performance. Second, we propose to re-rank the the embedding model’s retrieved demonstrations via the LVLM’s feedbacks, and calculate a list-wise ranking loss for training the embedding model. Third, we propose an iterative demonstration mining strategy to improve the training of the embedding model. Through extensive experiments on 3 types of visual-language tasks, 7 benchmark datasets, our DRUM framework is proven to be effective in boosting the LVLM’s in-context learning performance via retrieving more proper demonstrations.
Anthology ID:
2025.acl-srw.83
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Jin Zhao, Mingyang Wang, Zhu Liu
Venues:
ACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1051–1063
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.acl-srw.83/
DOI:
Bibkey:
Cite (ACL):
Ellen Yi-Ge, Jiechao Gao, Wei Han, and Wei Zhu. 2025. DRUM: Learning Demonstration Retriever for Large MUlti-modal Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 1051–1063, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
DRUM: Learning Demonstration Retriever for Large MUlti-modal Models (Yi-Ge et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.acl-srw.83.pdf