THAVQA: A German Task-oriented VQA Dataset Annotated with Human Visual Attention

Moritz Kronberger; Viviana Ventura

THAVQA: A German Task-oriented VQA Dataset Annotated with Human Visual Attention

Abstract

Video question answering (VQA) is a challenging task that requires models to generate answers by using both information from text and video. We present Task-oriented Human Attention Video Question Answering (THAVQA), a new VQA dataset consisting of third- and first- person videos of an instructor using a sewing machine. The sewing task is formalized step-by-step in a script: each step consists of a video annotated with German language open-ended question and answer (QA) pairs and with human visual attention. The paper also includes a first assessment of the performance of a pre-trained Multimodal Large Language Model (MLLM) in generating answers to the questions of our dataset across different experimental settings.Results show that our task-oriented dataset is challenging for pre-trained models. Specifically, the model struggles to answer questions requiring technical knowledge or spatio-temporal reasoning.

Anthology ID:: 2024.clicit-1.55
Volume:: Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)
Month:: December
Year:: 2024
Address:: Pisa, Italy
Editors:: Felice Dell'Orletta, Alessandro Lenci, Simonetta Montemagni, Rachele Sprugnoli
Venue:: CLiC-it
SIG:
Publisher:: CEUR Workshop Proceedings
Note:
Pages:: 459–469
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2024.clicit-1.55/
DOI:
Bibkey:
Cite (ACL):: Moritz Kronberger and Viviana Ventura. 2024. THAVQA: A German Task-oriented VQA Dataset Annotated with Human Visual Attention. In Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), pages 459–469, Pisa, Italy. CEUR Workshop Proceedings.
Cite (Informal):: THAVQA: A German Task-oriented VQA Dataset Annotated with Human Visual Attention (Kronberger & Ventura, CLiC-it 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2024.clicit-1.55.pdf

PDF Cite Search Fix data