Abstract
Weakly supervised phrase grounding aims to learn an alignment between phrases in a caption and objects in a corresponding image using only caption-image annotations, i.e., without phrase-object annotations. Previous methods typically use a caption-image contrastive loss to indirectly supervise the alignment between phrases and objects, which hinders the maximum use of the intrinsic structure of the multimodal data and leads to unsatisfactory performance. In this work, we directly use the phrase-object contrastive loss in the condition that no positive annotation is available in the first place. Specifically, we propose a novel contrastive learning framework based on the expectation-maximization algorithm that adaptively refines the target prediction. Experiments on two widely used benchmarks, Flickr30K Entities and RefCOCO+, demonstrate the effectiveness of our framework. We obtain 63.05% top-1 accuracy on Flickr30K Entities and 59.51%/43.46% on RefCOCO+ TestA/TestB, outperforming the previous methods by a large margin, even surpassing a previous SoTA that uses a pre-trained vision-language model. Furthermore, we deliver a theoretical analysis of the effectiveness of our method from the perspective of the maximum likelihood estimate with latent variables.- Anthology ID:
- 2022.emnlp-main.586
- Volume:
- Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8549–8559
- Language:
- URL:
- https://aclanthology.org/2022.emnlp-main.586
- DOI:
- Cite (ACL):
- Keqin Chen, Richong Zhang, Samuel Mensah, and Yongyi Mao. 2022. Contrastive Learning with Expectation-Maximization for Weakly Supervised Phrase Grounding. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8549–8559, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Cite (Informal):
- Contrastive Learning with Expectation-Maximization for Weakly Supervised Phrase Grounding (Chen et al., EMNLP 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.emnlp-main.586.pdf