Hanoona Rasheed


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2024

pdf bib
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz | Hanoona Rasheed | Salman Khan | Fahad Khan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the under-explored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with an LLM. The resulting model is capable of understanding and generating detailed conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantitative evaluation framework for video-based dialogue models to objectively analyze the strengths and weaknesses of video-based dialogue models. Code: https://github.com/mbzuai-oryx/Video-ChatGPT.