Viviana Ventura
2025
Instruction-tuned QwenChart for Chart Question Answering
Viviana Ventura
|
Lukas Amadeus Kleybolte
|
Alessandra Zarcone
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Charts, where information is delivered holis-tically by visual and textual features, repre-sent a challenge when it comes to downstreamtasks such as chart question answering, whereboth kinds of information contribute to the task.The standard approach is to decouple the taskin two steps, first extracting information fromthe charts, or representing it as a table, textor code, and then a second reasoning step tooutput the answers. Today, the advancementsin visual encoding of Visual Large LanguageModels (VLLM) have shown their capabilitiesto solve such complex tasks without using in-between representations of the charts or mas-sive in-domain training. Our new instructionfine-tuned and chain-of-thought model Qwen-Chart showed that even in a complex newbenchmark such as SciVQA general modelscan achieve great performances with low-costtraining, matching the capabilities that LLMshave showed in unimodal downstream tasks.An out-of-domain evaluation showed satisfac-tory results, albeit with an expected drop inperformance.
2024
THAVQA: A German Task-oriented VQA Dataset Annotated with Human Visual Attention
Moritz Kronberger
|
Viviana Ventura
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
Video question answering (VQA) is a challenging task that requires models to generate answers by using both information from text and video. We present Task-oriented Human Attention Video Question Answering (THAVQA), a new VQA dataset consisting of third- and first- person videos of an instructor using a sewing machine. The sewing task is formalized step-by-step in a script: each step consists of a video annotated with German language open-ended question and answer (QA) pairs and with human visual attention. The paper also includes a first assessment of the performance of a pre-trained Multimodal Large Language Model (MLLM) in generating answers to the questions of our dataset across different experimental settings.Results show that our task-oriented dataset is challenging for pre-trained models. Specifically, the model struggles to answer questions requiring technical knowledge or spatio-temporal reasoning.