Jian Ding
2025
InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows
Kirolos Ataallah | Eslam Mohamed Bakr | Mahmoud Ahmed | Chenhui Gou | Khushbu Pahwa | Jian Ding | Mohamed Elhoseiny
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Kirolos Ataallah | Eslam Mohamed Bakr | Mahmoud Ahmed | Chenhui Gou | Khushbu Pahwa | Jian Ding | Mohamed Elhoseiny
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a significant challenge for multi-modal models. Existing benchmarks often fail to test the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. Therefore, we introduce InfiniBench, a comprehensive benchmark designed to evaluate the capabilities of models in long video understanding rigorously.InfiniBench offers:(1) Over 1,000 hours of video content, with an average video length of 53 minutes.(2) The largest set of question-answer pairs for long video comprehension, totaling around 87.7 K.(3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context understanding, multi-event linking).(4) Rich annotation formats, including both multiple-choice and open-ended questions.We conducted an in-depth evaluation across both commercial (GPT-4o, Gemini 2.0 Flash) and most recent open-source vision-language models, such as Qwen2.5-VL, InternVL3.0). Results reveal that:(1) Models struggle across the board: Even the best model, GPT-4o, achieves only 47.1% on grounding-based skills, with most models performing near or just above random chance.(2) Strong reliance on world knowledge: Models achieve surprisingly high scores using only metadata (e.g., video titles), highlighting a tendency to rely on pre-trained knowledge rather than actual visual or temporal understanding.(3) Multi-Modal Importance: When provided with full video and subtitle context, however, models show substantial improvements, confirming the critical role of multimodal input in video understanding.Our findings underscore the inherent challenges in long-video comprehension and point to the need for substantial advancements in both grounding and reasoning capabilities in MLLMs.
Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue Generation
Bo Zhang | Hui Ma | Dailin Li | Jian Ding | Jian Wang | Bo Xu | HongFei Lin
Transactions of the Association for Computational Linguistics, Volume 13
Bo Zhang | Hui Ma | Dailin Li | Jian Ding | Jian Wang | Bo Xu | HongFei Lin
Transactions of the Association for Computational Linguistics, Volume 13
Large language models (LLMs) demonstrate remarkable text comprehension and generation capabilities but often lack the ability to utilize up-to-date or domain-specific knowledge not included in their training data. To address this gap, we introduce KEDiT, an efficient method for fine-tuning LLMs for knowledge-grounded dialogue generation. KEDiT operates in two main phases. First, it employs an information bottleneck to compress retrieved knowledge into learnable parameters, retaining essential information while minimizing computational overhead. Second, a lightweight knowledge-aware adapter integrates these compressed knowledge vectors into the LLM during fine-tuning, updating less than 2% of the model parameters. The experimental results on the Wizard of Wikipedia and a newly constructed PubMed-Dialog dataset demonstrate that KEDiT excels in generating contextually relevant and informative responses, outperforming competitive baselines in automatic, LLM-based, and human evaluations. This approach effectively combines the strengths of pretrained LLMs with the adaptability needed for incorporating dynamic knowledge, presenting a scalable solution for fields such as medicine.1