Yiming Lei

2026

Recent advancements in Multimodal Large Language Models (MLLMs) have achieved significant success in understanding static pre-recorded video scenarios (e.g., event-centric or narrative-driven content). However, existing MLLMs are largely trained on datasets restricted to static content due to the scarcity of high-quality interleaved data, causing them to struggle with dynamic interactions. Distinct from pre-recorded videos, live streaming is characterized by high-density, interleaved multimodal turns, where viewer comments (danmaku) are tightly coupled with real-time audio-visual evidence and evolving dialogue context. In such settings, purely textual annotations fail to capture fine-grained visual and temporal dependencies. To bridge this gap, we introduce **Live-Aid**, the first large-scale interleaved live interaction Chinese dataset with **human-annotated**, temporally aligned video responses, spanning over **1,100 hours** and 80,037 dialogue turns across 8,053 video sessions. Building on this, we leverage these high-quality annotations within a novel multi-agent pipeline to construct evaluation tasks targeting core capabilities of live interactions. Extensive evaluations of strong Video-LLMs and Omni-LLMs reveal critical limitations in interleaved multi-turn interactions requiring temporal reasoning, highlighting the value of **Live-Aid** in advancing interleaved multimodal reasoning and dynamic audio-visual dependencies.

2025

pdf bib abs

Video-based dialogue systems have compelling application value, such as education assistants, thereby garnering growing interest. However, the current video-based dialogue systems are limited by their reliance on a single dialogue type, which hinders their versatility in practical applications across a range of scenarios, including question-answering and emotionally dialog, etc. In this paper, we identify this challenge as how to generate video-driven multilingual mixed-type dialogues. To mitigate this challenge, we propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus, termed KwaiChat, containing a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. Additionally, we establish baseline models on KwaiChat. An extensive analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation even with the help of in-context learning and fine-tuning, which indicates that the task is not trivial and needs further research.

pdf bib abs

***Video Comment Art*** enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce **GODBench**, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs’ abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose **Ripple of Thought (RoT)**, a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments on GODBench reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improving creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity.

Co-authors

Hui Qiu 1

Venues

Findings2
ACL1

Fix author