RACCooN: Versatile Instructional Video Editing with Auto-Generated Narratives

Jaehong Yoon; Shoubin Yu; Mohit Bansal

RACCooN: Versatile Instructional Video Editing with Auto-Generated Narratives

Abstract

Recent video generative models primarily rely on detailed, labor-intensive text prompts for tasks, like inpainting or style editing, limiting adaptability for personal/raw videos. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video editing method, supporting diverse video editing capabilities, such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P), which automatically generates structured video descriptions capturing both scene context and object details, and Paragraph-to-Video (P2V), where users (optionally) refine these descriptions to guide a video diffusion model for flexible content modifications, including removing, changing subjects, and/or adding new objects. Key contributions of RACCooN include: (1) A multi-granular spatiotemporal pooling strategy for structured video understanding, capturing both broad context and fine-grained details of major objects to enable precise text-based video editing without the need for complex human annotations. (2) A video generative model fine-tuned on our curated video-paragraph-mask dataset, enhances the editing and inpainting quality. (3) The capability to seamlessly generate new objects in videos by forecasting their movements through automatically generated mask planning. In the end, users can easily edit complex videos with RACCooN’s automatic explanations and guidance. We demonstrate its versatile capabilities in video-to-paragraph generation (up to 9.4%p absolute improvement in human evaluations) and video content editing (relative to 49.7% lower FVD), and can be integrated with SoTA video generation models for further enhancement.

Anthology ID:: 2025.emnlp-main.1420
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27960–27996
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1420/
DOI:
Bibkey:
Cite (ACL):: Jaehong Yoon, Shoubin Yu, and Mohit Bansal. 2025. RACCooN: Versatile Instructional Video Editing with Auto-Generated Narratives. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27960–27996, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: RACCooN: Versatile Instructional Video Editing with Auto-Generated Narratives (Yoon et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1420.pdf
Checklist:: 2025.emnlp-main.1420.checklist.pdf

PDF Cite Search Checklist Fix data