A Parameter-Efficient and Fine-Grained Prompt Learning for Vision-Language Models

Yongbin Guo; Shuzhen Li; Zhulin Liu; Tong Zhang; C. L. Philip Chen

A Parameter-Efficient and Fine-Grained Prompt Learning for Vision-Language Models

Yongbin Guo, Shuzhen Li, Zhulin Liu, Tong Zhang, C.L.Philip Chen

Abstract

Current vision-language models (VLMs) understand complex vision-text tasks by extracting overall semantic information from large-scale cross-modal associations. However, extracting from large-scale cross-modal associations often smooths out semantic details and requires large computations, limiting multimodal fine-grained understanding performance and efficiency. To address this issue, this paper proposes a detail-oriented prompt learning (DoPL) method for vision-language models to implement fine-grained multi-modal semantic alignment with merely 0.25M trainable parameters. According to the low-entropy information concentration theory, DoPL explores shared interest tokens from text-vision correlations and transforms them into alignment weights to enhance text prompt and vision prompt via detail-oriented prompt generation. It effectively guides the current frozen layer to extract fine-grained text-vision alignment cues. Furthermore, DoPL constructs detail-oriented prompt generation for each frozen layer to implement layer-by-layer localization of fine-grained semantic alignment, achieving precise understanding in complex vision-text tasks. DoPL performs well in parameter-efficient fine-grained semantic alignment with only 0.12% tunable parameters for vision-language models. The state-of-the-art results over the previous parameter-efficient fine-tuning methods and full fine-tuning approaches on six benchmarks demonstrate the effectiveness and efficiency of DoPL in complex multi-modal tasks.

Anthology ID:: 2025.acl-long.1514
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 31346–31359
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1514/
DOI:
Bibkey:
Cite (ACL):: Yongbin Guo, Shuzhen Li, Zhulin Liu, Tong Zhang, and C.L.Philip Chen. 2025. A Parameter-Efficient and Fine-Grained Prompt Learning for Vision-Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31346–31359, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: A Parameter-Efficient and Fine-Grained Prompt Learning for Vision-Language Models (Guo et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1514.pdf

PDF Cite Search Fix data