Bringing Real-World Relations into Video Generation with Graph-Structured Knowledge

Joonhyung Park, Jaeyun Song, Sihwan Park, Eunho Yang


Abstract
Recent proprietary video generation models have demonstrated remarkable proficiency in synthesizing highly realistic videos from textual instructions. Most open-source text-to-video models, however, still struggle to accurately simulate real-world physics and dynamic entity interactions. Existing approaches rely on scaling laws and large-scale, high-quality video datasets to implicitly learn physical dynamics, yet this paradigm is constrained by prohibitive costs and the burdensome demands of data curation. Motivated by this, we propose a novel framework that integrates graph-structured temporal knowledge into video latent diffusion models to enhance compositional generation and interaction fidelity. Our framework constructs video scene graphs specifically designed to capture entity relationships, temporal dynamics, and global scene context. These graph-structured representations guide the generation process through cross-attention mechanisms. Additionally, we introduce Graph-Aligned Denoising Loss (GADL), a training objective that ensures adherence to conditioned graphs by incorporating node modification tasks within the denoising process, leveraging synchronized edited video-graph pairs. Comprehensive evaluations demonstrate that incorporating graph-structured knowledge significantly enhances compositionality and the accurate portrayal of real-world interactions in generated videos.
Anthology ID:
2026.acl-long.172
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3756–3771
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.172/
DOI:
Bibkey:
Cite (ACL):
Joonhyung Park, Jaeyun Song, Sihwan Park, and Eunho Yang. 2026. Bringing Real-World Relations into Video Generation with Graph-Structured Knowledge. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3756–3771, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Bringing Real-World Relations into Video Generation with Graph-Structured Knowledge (Park et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.172.pdf
Checklist:
 2026.acl-long.172.checklist.pdf