This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
VivianaVentura
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Charts, where information is delivered holistically by visual and textual features, represent a challenge when it comes to downstream tasks such as chart question answering, where both kinds of information contribute to the task. The standard approach is to decouple the task in two steps, first extracting information from the charts, or representing it as a table, text or code, and then a second reasoning step to output the answers. Today, the advancements in visual encoding of Visual Large Language Models (VLLM) have shown their capabilities to solve such complex tasks without using in-between representations of the charts or massive in-domain training. Our new instruction fine-tuned and chain-of-thought model QwenChart showed that even in a complex new benchmark such as SciVQA general models can achieve great performances with low-cost training, matching the capabilities that LLMs have showed in unimodal downstream tasks. An out-of-domain evaluation showed satisfactory results, albeit with an expected drop in performance.
Video question answering (VQA) is a challenging task that requires models to generate answers by using both information from text and video. We present Task-oriented Human Attention Video Question Answering (THAVQA), a new VQA dataset consisting of third- and first- person videos of an instructor using a sewing machine. The sewing task is formalized step-by-step in a script: each step consists of a video annotated with German language open-ended question and answer (QA) pairs and with human visual attention. The paper also includes a first assessment of the performance of a pre-trained Multimodal Large Language Model (MLLM) in generating answers to the questions of our dataset across different experimental settings.Results show that our task-oriented dataset is challenging for pre-trained models. Specifically, the model struggles to answer questions requiring technical knowledge or spatio-temporal reasoning.