Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding

Zi-Yi Dou; Nanyun Peng

doi:10.18653/v1/2021.emnlp-main.513

Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding

Abstract

Phrase grounding aims to map textual phrases to their associated image regions, which can be a prerequisite for multimodal reasoning and can benefit tasks requiring identifying objects based on language. With pre-trained vision-and-language models achieving impressive performance across tasks, it remains unclear if we can directly utilize their learned embeddings for phrase grounding without fine-tuning. To this end, we propose a method to extract matched phrase-region pairs from pre-trained vision-and-language embeddings and propose four fine-tuning objectives to improve the model phrase grounding ability using image-caption data without any supervised grounding signals. Experiments on two representative datasets demonstrate the effectiveness of our objectives, outperforming baseline models in both weakly-supervised and supervised phrase grounding settings. In addition, we evaluate the aligned embeddings on several other downstream tasks and show that we can achieve better phrase grounding without sacrificing representation generality.

Anthology ID:: 2021.emnlp-main.513
Volume:: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2021
Address:: Online and Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6362–6371
Language:
URL:: https://aclanthology.org/2021.emnlp-main.513
DOI:: 10.18653/v1/2021.emnlp-main.513
Bibkey:
Cite (ACL):: Zi-Yi Dou and Nanyun Peng. 2021. Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6362–6371, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding (Dou & Peng, EMNLP 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/emnlp-22-attachments/2021.emnlp-main.513.pdf
Video:: https://preview.aclanthology.org/emnlp-22-attachments/2021.emnlp-main.513.mp4
Data: MS COCO, RefCOCO, Visual Question Answering

PDF Search Video