Sanglu Lu
2025
PVTNL: Prompting Vision Transformers with Natural Language for Generalizable Person Re-identification
Wangning
|
Lei Xie
|
Sanglu Lu
|
Shiwei Gan
Findings of the Association for Computational Linguistics: EMNLP 2025
Domain generalization person re-identification (DG-ReID) aims to train models on source domains and generalize to unseen target domains.While patch-based Vision Transformers have achieved success in capturing fine-grained visual features, they often overlook global semantic structure and suffer from feature entanglement, leading to overfitting across domains. Meanwhile, natural language provides high-level semantic abstraction but lacks spatial precision for fine-grained alignment.We propose PVTNL (Prompting Vision Transformers with Natural Language), a novel framework for generalizable person re-identification. PVTNL leverages the pre-trained vision-language model BLIP to extract aligned visual and textual embeddings. Specifically, we utilize body-part cues to segment images into semantically coherent regions and align them with corresponding natural language descriptions. These region-level textual prompts are encoded and injected as soft prompts into the Vision Transformer to guide localized feature learning. Notably, our language module is retained during inference, enabling persistent semantic grounding that enhances cross-domain generalization.Extensive experiments on standard DG-ReID benchmarks demonstrate that PVTNL achieves state-of-the-art performance. Ablation studies further confirm the effectiveness of body-part-level alignment, soft language prompting, and the benefit of preserving language guidance at inference time.