Mingchao Liu
2026
IPS: In-Prompt Process Supervision for Short Video Content Moderation
Mingchao Liu | Yu Sun | Ruixiao Sun | Xin Dong | Xiang Shen | Hongwei Wang | Hongyu Xiong | Yang Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Mingchao Liu | Yu Sun | Ruixiao Sun | Xin Dong | Xiang Shen | Hongwei Wang | Hongyu Xiong | Yang Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Multimodal large language models (MLLMs) are effective at capturing the semantics of short video content; however, they often fail to attend to the policy-specific details required for reliable content moderation.To address this limitation, we introduce IPS, a novel framework that integrates In-prompt Process Supervision into MLLMs by introducing sequential reasoning over ancillary questions during fine-tuning. IPS consistently outperforms baseline MLLMs on public and proprietary benchmarks.Moreover, replacing human-annotated ancillary labels with MLLM-generated ones results in only marginal performance degradation, demonstrating robustness to noisy supervision and strong scalability with model-generated annotations.These findings establish IPS as a scalable and effective solution for complex multimodal classification in large-scale industrial settings.