EMCompress: Video-LLMs with Endomorphic Multimodal Compression

Zheyu Fan; Jiateng Liu; Yuji Zhang; Zihan Wang; Yi R. Fung; Manling Li; Heng Ji

EMCompress: Video-LLMs with Endomorphic Multimodal Compression

Zheyu Fan, Jiateng Liu, Yuji Zhang, Zihan Wang, Yi R. Fung, Manling Li, Heng Ji

Abstract

Video-LLMs face a fundamental tension in long-video reasoning: static, sparse frame sampling either dilutes evidence across task-irrelevant segments at significant cost or misses fine-grained temporal semantics altogether. We propose a novel, cognitively-inspired task — Endomorphic Multimodal Compression (EMC) — as a structurally-constrained sufficient-statistic problem for VideoQA, and formulate it as an endomorphic transformation F_EMC : (V, Q) → (v, q) that compresses the multimodal input while preserving answer invariance across reasonable downstream models. The endomorphic form keeps the compressed output in the downstream pipeline’s native task space — a structural mirror of the filter-then-reason mechanism in the cognitive literature motivating EMC — distinguishing it from latent-code compression (IB / VIB) and making the formulation extensible to other multimodal settings. Under the Markov chain A → (V, Q) → (v, q), EMC realizes the classical sufficiency condition I((v, q); A) = I((V, Q); A) in its VideoQA-natural form. As a modular front-end, EMC plugs into both Video Instruction Tuning and Video Question Answering pipelines. We release the first dedicated benchmark and propose ReSimplifyIt, an EMC baseline surpassing prior methods by 0.40 F-1 with competitive query rewriting. Integrating EMC yields relative gains of 7.33% in training and 33.7% in inference for video-language understanding.

Anthology ID:: 2026.findings-acl.8
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 137–162
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.8/
DOI:
Bibkey:
Cite (ACL):: Zheyu Fan, Jiateng Liu, Yuji Zhang, Zihan Wang, Yi R. Fung, Manling Li, and Heng Ji. 2026. EMCompress: Video-LLMs with Endomorphic Multimodal Compression. In Findings of the Association for Computational Linguistics: ACL 2026, pages 137–162, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: EMCompress: Video-LLMs with Endomorphic Multimodal Compression (Fan et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.8.pdf
Checklist:: 2026.findings-acl.8.checklist.pdf

PDF Cite Search Checklist Fix data