@inproceedings{kronberger-ventura-2024-thavqa,
    title = "{THAVQA}: A {G}erman Task-oriented {VQA} Dataset Annotated with Human Visual Attention",
    author = "Kronberger, Moritz  and
      Ventura, Viviana",
    editor = "Dell'Orletta, Felice  and
      Lenci, Alessandro  and
      Montemagni, Simonetta  and
      Sprugnoli, Rachele",
    booktitle = "Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)",
    month = dec,
    year = "2024",
    address = "Pisa, Italy",
    publisher = "CEUR Workshop Proceedings",
    url = "https://preview.aclanthology.org/ingest-emnlp/2024.clicit-1.55/",
    pages = "459--469",
    ISBN = "979-12-210-7060-6",
    abstract = "Video question answering (VQA) is a challenging task that requires models to generate answers by using both information from text and video. We present Task-oriented Human Attention Video Question Answering (THAVQA), a new VQA dataset consisting of third- and first- person videos of an instructor using a sewing machine. The sewing task is formalized step-by-step in a script: each step consists of a video annotated with German language open-ended question and answer (QA) pairs and with human visual attention. The paper also includes a first assessment of the performance of a pre-trained Multimodal Large Language Model (MLLM) in generating answers to the questions of our dataset across different experimental settings.Results show that our task-oriented dataset is challenging for pre-trained models. Specifically, the model struggles to answer questions requiring technical knowledge or spatio-temporal reasoning."
}Markdown (Informal)
[THAVQA: A German Task-oriented VQA Dataset Annotated with Human Visual Attention](https://preview.aclanthology.org/ingest-emnlp/2024.clicit-1.55/) (Kronberger & Ventura, CLiC-it 2024)
ACL