PVTNL: Prompting Vision Transformers with Natural Language for Generalizable Person Re-identification

Wangning; Lei Xie; Sanglu Lu; Shiwei Gan

doi:10.18653/v1/2025.findings-emnlp.1181

PVTNL: Prompting Vision Transformers with Natural Language for Generalizable Person Re-identification

Wangning, Lei Xie, Sanglu Lu, Shiwei Gan

Abstract

Domain generalization person re-identification (DG-ReID) aims to train models on source domains and generalize to unseen target domains.While patch-based Vision Transformers have achieved success in capturing fine-grained visual features, they often overlook global semantic structure and suffer from feature entanglement, leading to overfitting across domains. Meanwhile, natural language provides high-level semantic abstraction but lacks spatial precision for fine-grained alignment.We propose PVTNL (Prompting Vision Transformers with Natural Language), a novel framework for generalizable person re-identification. PVTNL leverages the pre-trained vision-language model BLIP to extract aligned visual and textual embeddings. Specifically, we utilize body-part cues to segment images into semantically coherent regions and align them with corresponding natural language descriptions. These region-level textual prompts are encoded and injected as soft prompts into the Vision Transformer to guide localized feature learning. Notably, our language module is retained during inference, enabling persistent semantic grounding that enhances cross-domain generalization.Extensive experiments on standard DG-ReID benchmarks demonstrate that PVTNL achieves state-of-the-art performance. Ablation studies further confirm the effectiveness of body-part-level alignment, soft language prompting, and the benefit of preserving language guidance at inference time.

Anthology ID:: 2025.findings-emnlp.1181
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21663–21674
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1181/
DOI:: 10.18653/v1/2025.findings-emnlp.1181
Bibkey:
Cite (ACL):: Wangning, Lei Xie, Sanglu Lu, and Shiwei Gan. 2025. PVTNL: Prompting Vision Transformers with Natural Language for Generalizable Person Re-identification. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 21663–21674, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: PVTNL: Prompting Vision Transformers with Natural Language for Generalizable Person Re-identification (Wangning et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1181.pdf
Checklist:: 2025.findings-emnlp.1181.checklist.pdf

PDF Cite Search Checklist Fix data