Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

Daeun Lee; Jaehong Yoon; Jaemin Cho; Mohit Bansal

Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal

Abstract

Recent text-to-video (T2V) diffusion models have made remarkable progress in generating high-quality videos. However, they often struggle to align with complex text prompts, particularly when multiple objects, attributes, or spatial relations are specified. We introduce VideoRepair, the first self-correcting, training-free, and model-agnostic video refinement framework that automatically detects fine- grained text–video misalignments and performs targeted, localized corrections. Our key insight is that even misaligned videos usually contain correctly generated regions that should be preserved rather than regenerated. Building on this observation, VideoRepair proposes a novel region-preserving refinement strategy with three stages: (i) misalignment detection, where MLLM-based evaluation with automatically generated evaluation questions identifies misaligned regions; (ii) refinement planning, which preserves correctly generated entities, segments their regions across frames, and constructs targeted prompts for misaligned areas; and (iii) localized refinement, which selectively regenerates problematic regions while preserving faithful content through joint optimization of preserved and newly generated areas. On two benchmarks, EvalCrafter and T2V-CompBench with four recent T2V backbones, VideoRepair achieves substantial improvements over recent baselines across diverse alignment metrics. Comprehensive ablations further demonstrate the efficiency, robustness, and interpretability of our framework.

Anthology ID:: 2026.findings-acl.1817
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 36464–36489
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1817/
DOI:
Bibkey:
Cite (ACL):: Daeun Lee, Jaehong Yoon, Jaemin Cho, and Mohit Bansal. 2026. Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement. In Findings of the Association for Computational Linguistics: ACL 2026, pages 36464–36489, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement (Lee et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1817.pdf
Checklist:: 2026.findings-acl.1817.checklist.pdf

PDF Cite Search Checklist Fix data