Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval

Tianlu Zheng; Yifan Zhang; Xiang An; Ziyong Feng; Kaicheng Yang; Qichuan Ding

Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval

Tianlu Zheng, Yifan Zhang, Xiang An, Ziyong Feng, Kaicheng Yang, Qichuan Ding

Abstract

Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks. The data and pre-trained models are released at https://github.com/Multimodal-Representation-Learning-MRL/GA-DMS.

Anthology ID:: 2025.emnlp-main.14
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 259–271
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.14/
DOI:
Bibkey:
Cite (ACL):: Tianlu Zheng, Yifan Zhang, Xiang An, Ziyong Feng, Kaicheng Yang, and Qichuan Ding. 2025. Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 259–271, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval (Zheng et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.14.pdf
Checklist:: 2025.emnlp-main.14.checklist.pdf

PDF Cite Search Checklist Fix data