Abstract
Phrase grounding aims to map textual phrases to their associated image regions, which can be a prerequisite for multimodal reasoning and can benefit tasks requiring identifying objects based on language. With pre-trained vision-and-language models achieving impressive performance across tasks, it remains unclear if we can directly utilize their learned embeddings for phrase grounding without fine-tuning. To this end, we propose a method to extract matched phrase-region pairs from pre-trained vision-and-language embeddings and propose four fine-tuning objectives to improve the model phrase grounding ability using image-caption data without any supervised grounding signals. Experiments on two representative datasets demonstrate the effectiveness of our objectives, outperforming baseline models in both weakly-supervised and supervised phrase grounding settings. In addition, we evaluate the aligned embeddings on several other downstream tasks and show that we can achieve better phrase grounding without sacrificing representation generality.- Anthology ID:
- 2021.emnlp-main.513
- Volume:
- Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2021
- Address:
- Online and Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6362–6371
- Language:
- URL:
- https://aclanthology.org/2021.emnlp-main.513
- DOI:
- 10.18653/v1/2021.emnlp-main.513
- Cite (ACL):
- Zi-Yi Dou and Nanyun Peng. 2021. Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6362–6371, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding (Dou & Peng, EMNLP 2021)
- PDF:
- https://preview.aclanthology.org/emnlp-22-attachments/2021.emnlp-main.513.pdf
- Data
- MS COCO, RefCOCO, Visual Question Answering