UNIVID: Unified Vision-Language Model for Video Moderation

Kejuan Yang, Yizhuo Zhang, Mingyuan Du, Yue Zhang, Dixin Zheng, Kaili Zhao, Yang Xiao, Hanzhong Liang, Kenan Xiao


Abstract
Global-scale video moderation faces a dual challenge: the need for fine-grained multimodal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmented black-box classifiers that are difficult to maintain and lack transparency.In this paper, we present UNIVID, a Unified Vision-Language model for Video Moderation. Unlike standard classification models, UNIVID generates policy-aware captions that serve as an interpretable intermediate representation, enabling human-verifiable decisions and multi-task reusability. While existing open-source and commercial VLMs often suffer from safety-guardrail refusals and lack fine-grained policy alignment, we develop a specialized training data recipe that combines expert human-refined labels with synthetic data to align the model with our safety guidelines.By integrating UNIVID as the core captioner, we design a novel end-to-end video moderation system that reduces violation leakage by 42.7% and overkill rate by 37.0% relatively. Meanwhile, by replacing over 1,000 policy-specific models with a single UNIVID backbone, we recycle extensive computational resources while significantly reducing engineering maintenance overhead. To our knowledge, this is one of the first reports of a high-efficiency captioning VLM successfully supporting industrial-scale moderation and cross-functional business.
Anthology ID:
2026.acl-industry.32
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Yunyao Li, Georg Rehm, Mei Tu
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
467–479
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-industry.32/
DOI:
Bibkey:
Cite (ACL):
Kejuan Yang, Yizhuo Zhang, Mingyuan Du, Yue Zhang, Dixin Zheng, Kaili Zhao, Yang Xiao, Hanzhong Liang, and Kenan Xiao. 2026. UNIVID: Unified Vision-Language Model for Video Moderation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 467–479, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
UNIVID: Unified Vision-Language Model for Video Moderation (Yang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-industry.32.pdf