Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, Shih-Fu Chang


Abstract
From a visual scene containing multiple people, human is able to distinguish each individual given the context descriptions about what happened before, their mental/physical states or intentions, etc. Above ability heavily relies on human-centric commonsense knowledge and reasoning. For example, if asked to identify the “person who needs healing” in an image, we need to first know that they usually have injuries or suffering expressions, then find the corresponding visual clues before finally grounding the person. We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models’ ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions. We further create a benchmark, HumanCog, a dataset with 130k grounded commonsensical descriptions annotated on 67k images, covering diverse types of commonsense and visual scenes. We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models. Further analysis demonstrates that rich visual commonsense and powerful integration of multi-modal commonsense are essential, which sheds light on future works. Data and code will be available at https://github.com/Hxyou/HumanCog.
Anthology ID:
2022.findings-emnlp.399
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5444–5454
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.399
DOI:
10.18653/v1/2022.findings-emnlp.399
Bibkey:
Cite (ACL):
Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, and Shih-Fu Chang. 2022. Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5444–5454, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding (You et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2022.findings-emnlp.399.pdf