MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding
Qinxin Wang, Hao Tan, Sheng Shen, Michael Mahoney, Zhewei Yao
Abstract
Phrase localization is a task that studies the mapping from textual phrases to regions of an image. Given difficulties in annotating phrase-to-object datasets at scale, we develop a Multimodal Alignment Framework (MAF) to leverage more widely-available caption-image datasets, which can then be used as a form of weak supervision. We first present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations. By adopting a contrastive objective, our method uses information in caption-image pairs to boost the performance in weakly-supervised scenarios. Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods. With the help of the visually-aware language representations, we can also improve the previous best unsupervised result by 5.56%. We conduct ablation studies to show that both our novel model and our weakly-supervised strategies significantly contribute to our strong results.- Anthology ID:
- 2020.emnlp-main.159
- Volume:
- Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2030–2038
- Language:
- URL:
- https://aclanthology.org/2020.emnlp-main.159
- DOI:
- 10.18653/v1/2020.emnlp-main.159
- Cite (ACL):
- Qinxin Wang, Hao Tan, Sheng Shen, Michael Mahoney, and Zhewei Yao. 2020. MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2030–2038, Online. Association for Computational Linguistics.
- Cite (Informal):
- MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding (Wang et al., EMNLP 2020)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/2020.emnlp-main.159.pdf
- Code
- qinzzz/Multimodal-Alignment-Framework
- Data
- COCO, Visual Genome