GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations

Muhammet Ilaslan; Chenan Song; Joya Chen; Difei Gao; Weixian Lei; Qianli Xu; Joo Lim; Mike Shou

doi:10.18653/v1/2023.emnlp-main.648

GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations

Muhammet Ilaslan, Chenan Song, Joya Chen, Difei Gao, Weixian Lei, Qianli Xu, Joo Lim, Mike Shou

Abstract

The usage of exocentric and egocentric videos in Video Question Answering (VQA) is a new endeavor in human-robot interaction and collaboration studies. Particularly for egocentric videos, one may leverage eye-gaze information to understand human intentions during the task. In this paper, we build a novel task-oriented VQA dataset, called GazeVQA, for collaborative tasks where gaze information is captured during the task process. GazeVQA is designed with a novel QA format that covers thirteen different reasoning types to capture multiple aspects of task information and user intent. For each participant, GazeVQA consists of more than 1,100 textual questions and more than 500 labeled images that were annotated with the assistance of the Segment Anything Model. In total, 2,967 video clips, 12,491 labeled images, and 25,040 questions from 22 participants were included in the dataset. Additionally, inspired by the assisting models and common ground theory for industrial task collaboration, we propose a new AI model called AssistGaze that is designed to answer the questions with three different answer types, namely textual, image, and video. AssistGaze can effectively ground the perceptual input into semantic information while reducing ambiguities. We conduct comprehensive experiments to demonstrate the challenges of GazeVQA and the effectiveness of AssistGaze.

Anthology ID:: 2023.emnlp-main.648
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10462–10479
Language:
URL:: https://aclanthology.org/2023.emnlp-main.648
DOI:: 10.18653/v1/2023.emnlp-main.648
Bibkey:
Cite (ACL):: Muhammet Ilaslan, Chenan Song, Joya Chen, Difei Gao, Weixian Lei, Qianli Xu, Joo Lim, and Mike Shou. 2023. GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10462–10479, Singapore. Association for Computational Linguistics.
Cite (Informal):: GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations (Ilaslan et al., EMNLP 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/improve-issue-templates/2023.emnlp-main.648.pdf
Video:: https://preview.aclanthology.org/improve-issue-templates/2023.emnlp-main.648.mp4

PDF Search Video