Video Dialog via Progressive Inference and Cross-Transformer

Weike Jin, Zhou Zhao, Mao Gu, Jun Xiao, Furu Wei, Yueting Zhuang


Abstract
Video dialog is a new and challenging task, which requires the agent to answer questions combining video information with dialog history. And different from single-turn video question answering, the additional dialog history is important for video dialog, which often includes contextual information for the question. Existing visual dialog methods mainly use RNN to encode the dialog history as a single vector representation, which might be rough and straightforward. Some more advanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit reasoning process. In this paper, we introduce a novel progressive inference mechanism for video dialog, which progressively updates query information based on dialog history and video content until the agent think the information is sufficient and unambiguous. In order to tackle the multi-modal fusion problem, we propose a cross-transformer module, which could learn more fine-grained and comprehensive interactions both inside and between the modalities. And besides answer generation, we also consider question generation, which is more challenging but significant for a complete video dialog system. We evaluate our method on two large-scale datasets, and the extensive experiments show the effectiveness of our method.
Anthology ID:
D19-1217
Volume:
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:
November
Year:
2019
Address:
Hong Kong, China
Venues:
EMNLP | IJCNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2109–2118
Language:
URL:
https://aclanthology.org/D19-1217
DOI:
10.18653/v1/D19-1217
Bibkey:
Cite (ACL):
Weike Jin, Zhou Zhao, Mao Gu, Jun Xiao, Furu Wei, and Yueting Zhuang. 2019. Video Dialog via Progressive Inference and Cross-Transformer. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2109–2118, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Video Dialog via Progressive Inference and Cross-Transformer (Jin et al., EMNLP-IJCNLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/D19-1217.pdf