Incorporating Background Knowledge into Video Description Generation
Spencer Whitehead, Heng Ji, Mohit Bansal, Shih-Fu Chang, Clare Voss
Abstract
Most previous efforts toward video captioning focus on generating generic descriptions, such as, “A man is talking.” We collect a news video dataset to generate enriched descriptions that include important background knowledge, such as named entities and related events, which allows the user to fully understand the video content. We develop an approach that uses video meta-data to retrieve topically related news documents for a video and extracts the events and named entities from these documents. Then, given the video as well as the extracted events and entities, we generate a description using a Knowledge-aware Video Description network. The model learns to incorporate entities found in the topically related documents into the description via an entity pointer network and the generation procedure is guided by the event and entity types from the topically related documents through a knowledge gate, which is a gating mechanism added to the model’s decoder that takes a one-hot vector of these types. We evaluate our approach on the new dataset of news videos we have collected, establishing the first benchmark for this dataset as well as proposing a new metric to evaluate these descriptions.- Anthology ID:
- D18-1433
- Volume:
- Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
- Month:
- October-November
- Year:
- 2018
- Address:
- Brussels, Belgium
- Venue:
- EMNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3992–4001
- Language:
- URL:
- https://aclanthology.org/D18-1433
- DOI:
- 10.18653/v1/D18-1433
- Cite (ACL):
- Spencer Whitehead, Heng Ji, Mohit Bansal, Shih-Fu Chang, and Clare Voss. 2018. Incorporating Background Knowledge into Video Description Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3992–4001, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- Incorporating Background Knowledge into Video Description Generation (Whitehead et al., EMNLP 2018)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/D18-1433.pdf