MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding

Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, Fei Luo


Abstract
The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce **MAVIS**, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a **Structured Semantic Library**, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a **Logic-aware Debate** mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of "controversial” candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.
Anthology ID:
2026.findings-acl.1094
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21751–21764
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1094/
DOI:
Bibkey:
Cite (ACL):
Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, and Fei Luo. 2026. MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding. In Findings of the Association for Computational Linguistics: ACL 2026, pages 21751–21764, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding (Zhang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1094.pdf
Checklist:
 2026.findings-acl.1094.checklist.pdf