Walk in Others’ Shoes with a Single Glance: Human-Centric Visual Grounding with Top-View Perspective Transformation

Yuqi Bu, Xin Wu, Zirui Zhao, Yi Cai, David Hsu, Qiong Liu


Abstract
Visual perspective-taking, an ability to envision others’ perspectives from a single self-perspective, is vital in human-robot interactions. Thus, we introduce a human-centric visual grounding task and a dataset to evaluate this ability. Recent advances in vision-language models (VLMs) have shown potential for inferring others’ perspectives, yet are insensitive to information differences induced by slight perspective changes. To address this problem, we propose a top-view enhanced perspective transformation (TEP) method, which decomposes the transition from robot to human perspectives through an abstract top-view representation. It unifies perspectives and facilitates the capture of information differences from diverse perspectives. Experimental results show that TEP improves performance by up to 18%, exhibits perspective-taking abilities across various perspectives, and generalizes effectively to robotic and dynamic scenarios.
Anthology ID:
2025.acl-long.1306
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
26904–26923
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1306/
DOI:
Bibkey:
Cite (ACL):
Yuqi Bu, Xin Wu, Zirui Zhao, Yi Cai, David Hsu, and Qiong Liu. 2025. Walk in Others’ Shoes with a Single Glance: Human-Centric Visual Grounding with Top-View Perspective Transformation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26904–26923, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Walk in Others’ Shoes with a Single Glance: Human-Centric Visual Grounding with Top-View Perspective Transformation (Bu et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1306.pdf