Qingpei Guo


2025

pdf bib
VQAGuider: Guiding Multimodal Large Language Models to Answer Complex Video Questions
Yuyan Chen | Jiyuan Jia | Jiaxin Lu | Siyue Li | Yu Guan | Ming Yang | Qingpei Guo
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Complex video question-answering (VQA) requires in-depth understanding of video contents including object and action recognition as well as video classification and summarization, which exhibits great potential in emerging applications in education and entertainment, etc. Multimodal large language models (MLLMs) may accomplish this task by grasping the intention of a question and decomposing it to a series of visual recognition sub-tasks to find out the answer with the help of an agent. To tackle this task, we first collect a new dedicated Complex VQA dataset named CVQA and then propose VQAGuider, an innovative framework planning a few atomic visual recognition tools by video-related API matching. VQAGuider facilitates a deep engagement with video content and precise responses to complex video-related questions by MLLMs, which is beyond aligning visual and language features for simple VQA tasks. Our experiments demonstrate VQAGuider is capable of navigating the complex VQA tasks by MLLMs and improves the accuracy by 29.6% and 17.2% on CVQA and the existing VQA datasets, respectively, highlighting its potential in advancing MLLMs’s capabilities in video understanding.

2024

pdf bib
HOTVCOM: Generating Buzzworthy Comments for Videos
Yuyan Chen | Songzhou Yan | Qingpei Guo | Jiyuan Jia | Zhixu Li | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2024

In the era of social media video platforms, popular “hot-comments” play a crucial role in attracting user impressions of short-form videos, making them vital for marketing and branding purpose. However, existing research predominantly focuses on generating descriptive comments or “danmaku” in English, offering immediate reactions to specific video moments. Addressing this gap, our study introduces HOTVCOM, the largest Chinese video hot-comment dataset, comprising 94k diverse videos and 137 million comments. We also present the ComHeat framework, which synergistically integrates visual, auditory, and textual data to generate influential hot-comments on the Chinese video dataset. Empirical evaluations highlight the effectiveness of our framework, demonstrating its excellence on both the newly constructed and existing datasets.