Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues

Zory Zhang, Pinyuan Feng, Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, Hokin Deng, Ziqiao Ma, Yijiang Li, Dezhi Luo


Abstract
Where someone looks is a nonverbal communication cue that children and adults readily use.How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table. Importantly, we also controlled the gazer’s head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained. We found a substantial performance gap between VLMs and humans, ruled out alternative explanations such as resolution and object-naming skills, and identified the main reason for the gap as VLMs inferring gaze direction using head orientation rather than eye appearance.Such a bias is likely due to data rather than architecture, as suggested by a proof-of-concept experiment finetuning a transformer-based vision model.Future work should investigate whether these findings hold broadly across various deep learning methods trained on existing data, and whether better data mitigates this problem for all architectures.Pinpointing the reason sets the stage for technologies that can interpret gaze targets to have more efficient interactions with humans.
Anthology ID:
2026.findings-acl.504
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10370–10389
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.504/
DOI:
Bibkey:
Cite (ACL):
Zory Zhang, Pinyuan Feng, Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, Hokin Deng, Ziqiao Ma, Yijiang Li, and Dezhi Luo. 2026. Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10370–10389, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues (Zhang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.504.pdf
Checklist:
 2026.findings-acl.504.checklist.pdf