Bernt Schiele


2020

pdf
Diverse and Relevant Visual Storytelling with Scene Graph Embeddings
Xudong Hong | Rakshith Shetty | Asad Sayeed | Khushboo Mehra | Vera Demberg | Bernt Schiele
Proceedings of the 24th Conference on Computational Natural Language Learning

A problem in automatically generated stories for image sequences is that they use overly generic vocabulary and phrase structure and fail to match the distributional characteristics of human-generated text. We address this problem by introducing explicit representations for objects and their relations by extracting scene graphs from the images. Utilizing an embedding of this scene graph enables our model to more explicitly reason over objects and their relations during story generation, compared to the global features from an object classifier used in previous work. We apply metrics that account for the diversity of words and phrases of generated stories as well as for reference to narratively-salient image features and show that our approach outperforms previous systems. Our experiments also indicate that our models obtain competitive results on reference-based metrics.

2018

pdf
A vision-grounded dataset for predicting typical locations for verbs
Nelson Mukuze | Anna Rohrbach | Vera Demberg | Bernt Schiele
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2013

pdf
Grounding Action Descriptions in Videos
Michaela Regneri | Marcus Rohrbach | Dominikus Wetzel | Stefan Thater | Bernt Schiele | Manfred Pinkal
Transactions of the Association for Computational Linguistics, Volume 1

Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used. In this paper, we consider the problem of grounding sentences describing actions in visual information extracted from videos. We present a general purpose corpus that aligns high quality videos with multiple natural language descriptions of the actions portrayed in the videos, together with an annotation of how similar the action descriptions are to each other. Experimental results demonstrate that a text-based model of similarity between actions improves substantially when combined with visual information from videos depicting the described actions.