MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding

Jie Zhang; Qilang Ye; Hao Zhou; Haochen Liang; Fei Luo

MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding

Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, Fei Luo

Abstract

The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce **MAVIS**, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a **Structured Semantic Library**, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a **Logic-aware Debate** mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of "controversial” candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.

Anthology ID:: 2026.findings-acl.1094
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21751–21764
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1094/
DOI:
Bibkey:
Cite (ACL):: Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, and Fei Luo. 2026. MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding. In Findings of the Association for Computational Linguistics: ACL 2026, pages 21751–21764, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding (Zhang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1094.pdf
Checklist:: 2026.findings-acl.1094.checklist.pdf

PDF Cite Search Checklist Fix data