History for Visual Dialog: Do we really need it?
Shubham Agarwal, Trung Bui, Joon-Young Lee, Ioannis Konstas, Verena Rieser
Abstract
Visual Dialogue involves “understanding” the dialogue history (what has been discussed previously) and the current question (what is asked), in addition to grounding information in the image, to accurately generate the correct response. In this paper, we show that co-attention models which explicitly encode dialoh history outperform models that don’t, achieving state-of-the-art performance (72 % NDCG on val set). However, we also expose shortcomings of the crowdsourcing dataset collection procedure, by showing that dialogue history is indeed only required for a small amount of the data, and that the current evaluation metric encourages generic replies. To that end, we propose a challenging subset (VisdialConv) of the VisdialVal set and the benchmark NDCG of 63%.- Anthology ID:
- 2020.acl-main.728
- Volume:
- Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2020
- Address:
- Online
- Editors:
- Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8182–8197
- Language:
- URL:
- https://preview.aclanthology.org/remove-affiliations/2020.acl-main.728/
- DOI:
- 10.18653/v1/2020.acl-main.728
- Cite (ACL):
- Shubham Agarwal, Trung Bui, Joon-Young Lee, Ioannis Konstas, and Verena Rieser. 2020. History for Visual Dialog: Do we really need it?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8182–8197, Online. Association for Computational Linguistics.
- Cite (Informal):
- History for Visual Dialog: Do we really need it? (Agarwal et al., ACL 2020)
- PDF:
- https://preview.aclanthology.org/remove-affiliations/2020.acl-main.728.pdf
- Code
- shubhamagarwal92/visdial_conv + additional community code
- Data
- VisDial, VisPro