Diverse and Relevant Visual Storytelling with Scene Graph Embeddings

Xudong Hong, Rakshith Shetty, Asad Sayeed, Khushboo Mehra, Vera Demberg, Bernt Schiele


Abstract
A problem in automatically generated stories for image sequences is that they use overly generic vocabulary and phrase structure and fail to match the distributional characteristics of human-generated text. We address this problem by introducing explicit representations for objects and their relations by extracting scene graphs from the images. Utilizing an embedding of this scene graph enables our model to more explicitly reason over objects and their relations during story generation, compared to the global features from an object classifier used in previous work. We apply metrics that account for the diversity of words and phrases of generated stories as well as for reference to narratively-salient image features and show that our approach outperforms previous systems. Our experiments also indicate that our models obtain competitive results on reference-based metrics.
Anthology ID:
2020.conll-1.34
Volume:
Proceedings of the 24th Conference on Computational Natural Language Learning
Month:
November
Year:
2020
Address:
Online
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
420–430
Language:
URL:
https://aclanthology.org/2020.conll-1.34
DOI:
10.18653/v1/2020.conll-1.34
Bibkey:
Cite (ACL):
Xudong Hong, Rakshith Shetty, Asad Sayeed, Khushboo Mehra, Vera Demberg, and Bernt Schiele. 2020. Diverse and Relevant Visual Storytelling with Scene Graph Embeddings. In Proceedings of the 24th Conference on Computational Natural Language Learning, pages 420–430, Online. Association for Computational Linguistics.
Cite (Informal):
Diverse and Relevant Visual Storytelling with Scene Graph Embeddings (Hong et al., CoNLL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.conll-1.34.pdf
Data
VIST