Silvio Savarese
2022
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
Anthony Meng Huat Tiong
|
Junnan Li
|
Boyang Li
|
Silvio Savarese
|
Steven C.H. Hoi
Findings of the Association for Computational Linguistics: EMNLP 2022
Visual question answering (VQA) is a hallmark of vision and language reasoningand a challenging task under the zero-shot setting.We propose Plug-and-Play VQA (PNP-VQA),a modular framework for zero-shot VQA.In contrast to most existing works, which require substantial adaptation of pretrained language models (PLMs) for the vision modality,PNP-VQA requires no additional training of the PLMs.Instead, we propose to use natural language and network interpretation as an intermediate representation that glues pretrained models together. We first generate question-guided informative image captions,and pass the captions to a PLM as context for question answering.Surpassing end-to-end trained baselines, PNP-VQA achieves state-of-the-art results on zero-shot VQAv2 and GQA. With 11B parameters, it outperforms the 80B-parameter Flamingo model by 8.5% on VQAv2. With 738M PLM parameters, PNP-VQA achieves an improvement of 9.1% on GQA over FewVLM with 740M PLM parameters.
2018
Translating Navigation Instructions in Natural Language to a High-Level Plan for Behavioral Robot Navigation
Xiaoxue Zang
|
Ashwini Pokle
|
Marynel Vázquez
|
Kevin Chen
|
Juan Carlos Niebles
|
Alvaro Soto
|
Silvio Savarese
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
We propose an end-to-end deep learning model for translating free-form natural language instructions to a high-level plan for behavioral robot navigation. We use attention models to connect information from both the user instructions and a topological representation of the environment. We evaluate our model’s performance on a new dataset containing 10,050 pairs of navigation instructions. Our model significantly outperforms baseline approaches. Furthermore, our results suggest that it is possible to leverage the environment map as a relevant knowledge base to facilitate the translation of free-form navigational instruction.
2013
Learning Hierarchical Linguistic Descriptions of Visual Datasets
Roni Mittelman
|
Min Sun
|
Benjamin Kuipers
|
Silvio Savarese
Proceedings of the Workshop on Vision and Natural Language Processing
Search
Co-authors
- Roni Mittelman 1
- Min Sun 1
- Benjamin Kuipers 1
- Xiaoxue Zang 1
- Ashwini Pokle 1
- show all...