TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering

Chuyi Shang; Amos You; Sanjay Subramanian; Trevor Darrell; Roei Herzig

doi:10.18653/v1/2024.emnlp-main.544

TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering

Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, Roei Herzig

Abstract

Recently, image-based Large Multimodal Models (LMMs) have made significant progress in video question-answering (VideoQA) using a frame-wise approach by leveraging large-scale pretraining in a zero-shot manner. Nevertheless, these models need to be capable of finding relevant information, extracting it, and answering the question simultaneously. Currently, existing methods perform all of these steps in a single pass without being able to adapt if insufficient or incorrect information is collected. To overcome this, we introduce a modular multi-LMM agent framework based on several agents with different roles, instructed by a Planner agent that updates its instructions using shared feedback from the other agents. Specifically, we propose TraveLER, a method that can create a plan to "**Trave**rse” through the video, ask questions about individual frames to "**L**ocate” and store key information, and then "**E**valuate” if there is enough information to answer the question. Finally, if there is not enough information, our method is able to "**R**eplan” based on its collected knowledge. Through extensive experiments, we find that the proposed TraveLER approach improves performance on several VideoQA benchmarks without the need to fine-tune on specific datasets. Our code is available at https://github.com/traveler-framework/TraveLER.

Anthology ID:: 2024.emnlp-main.544
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9740–9766
Language:
URL:: https://aclanthology.org/2024.emnlp-main.544
DOI:: 10.18653/v1/2024.emnlp-main.544
Bibkey:
Cite (ACL):: Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, and Roei Herzig. 2024. TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9740–9766, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering (Shang et al., EMNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/dois-2013-emnlp/2024.emnlp-main.544.pdf

PDF Search