Incorporating Background Knowledge into Video Description Generation

Spencer Whitehead, Heng Ji, Mohit Bansal, Shih-Fu Chang, Clare Voss


Abstract
Most previous efforts toward video captioning focus on generating generic descriptions, such as, “A man is talking.” We collect a news video dataset to generate enriched descriptions that include important background knowledge, such as named entities and related events, which allows the user to fully understand the video content. We develop an approach that uses video meta-data to retrieve topically related news documents for a video and extracts the events and named entities from these documents. Then, given the video as well as the extracted events and entities, we generate a description using a Knowledge-aware Video Description network. The model learns to incorporate entities found in the topically related documents into the description via an entity pointer network and the generation procedure is guided by the event and entity types from the topically related documents through a knowledge gate, which is a gating mechanism added to the model’s decoder that takes a one-hot vector of these types. We evaluate our approach on the new dataset of news videos we have collected, establishing the first benchmark for this dataset as well as proposing a new metric to evaluate these descriptions.
Anthology ID:
D18-1433
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
3992–4001
Language:
URL:
https://aclanthology.org/D18-1433
DOI:
10.18653/v1/D18-1433
Bibkey:
Cite (ACL):
Spencer Whitehead, Heng Ji, Mohit Bansal, Shih-Fu Chang, and Clare Voss. 2018. Incorporating Background Knowledge into Video Description Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3992–4001, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Incorporating Background Knowledge into Video Description Generation (Whitehead et al., EMNLP 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ml4al-ingestion/D18-1433.pdf