Guijin Luo
2026
Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search
Zequn Xie | Guijin Luo | Chuxin Wang | Sihang Cai | Tao Jin | Zhou Zhao | Yixuan Tang
Findings of the Association for Computational Linguistics: ACL 2026
Zequn Xie | Guijin Luo | Chuxin Wang | Sihang Cai | Tao Jin | Zhou Zhao | Yixuan Tang
Findings of the Association for Computational Linguistics: ACL 2026
Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.
Iterative Self-Correction for Text-Driven Person Re-Identification with Large Vision-Language Models
Guijin Luo | Zequn Xie | Sihang Cai | Chuxin Wang | Zhou Zhao | Yixuan Tang
Findings of the Association for Computational Linguistics: ACL 2026
Guijin Luo | Zequn Xie | Sihang Cai | Chuxin Wang | Zhou Zhao | Yixuan Tang
Findings of the Association for Computational Linguistics: ACL 2026
Person Re-Identification (ReID) has long struggled with the semantic gap between low-level visual features and high-level identity concepts. While Vision-Language Models (VLMs) offer promising semantic understanding, existing methods typically adopt a static "one-pass" paradigm, converting images to text once for retrieval. This approach suffers from two critical flaws: Information Bottleneck, where converting rich visuals into text causes detail loss, and Open-Loop Failure, where initial hallucinations propagate without recourse. To address this, we propose Auto-ReID, a novel framework that reformulates ReID as an iterative "Think-and-Refine" process. We first introduce a Hierarchical Progressive Tuning strategy to transform a generic VLM into a specialized Re-ID expert. During inference, we deploy a closed-loop architecture comprising a Reasoner for structured attribute extraction, a Hybrid Retriever that anchors dynamic semantic queries with stable visual features to prevent drift, and a Corrector that deconstructs and verifies candidates to iteratively optimize the search. Extensive experiments on ReID datasets demonstrate that our method significantly outperforms state-of-the-art approaches, particularly in complex occlusion scenarios.