OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

Lu Zhang; Tiancheng Zhao; Heting Ying; Yibo Ma; Kyusong Lee

doi:10.18653/v1/2024.emnlp-main.559

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, Kyusong Lee

Abstract

Recent advancements in Large Language Models (LLMs) have expanded their capabilities to multimodal contexts, including comprehensive video understanding. However, processing extensive videos such as 24-hour CCTV footage or full-length films presents significant challenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, often result in substantial information loss. To address these shortcomings, we develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries, preserving the detailed content of videos. Additionally, it features an Divide-and-Conquer Loop capable of autonomous reasoning, dynamically invoking APIs and tools to enhance query processing and accuracy. This approach ensures robust video understanding, significantly reducing information loss. Experimental results affirm OmAgent’s efficacy in handling various types of videos and complex tasks. Moreover, we have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.

Anthology ID:: 2024.emnlp-main.559
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10031–10045
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2024.emnlp-main.559/
DOI:: 10.18653/v1/2024.emnlp-main.559
Bibkey:
Cite (ACL):: Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, and Kyusong Lee. 2024. OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10031–10045, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer (Zhang et al., EMNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2024.emnlp-main.559.pdf

PDF Cite Search Fix data