Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

Wanrong Zhu, Xin Wang, Tsu-Jui Fu, An Yan, Pradyumna Narayana, Kazoo Sone, Sugato Basu, William Yang Wang


Abstract
One of the most challenging topics in Natural Language Processing (NLP) is visually-grounded language understanding and reasoning. Outdoor vision-and-language navigation (VLN) is such a task where an agent follows natural language instructions and navigates in real-life urban environments. With the lack of human-annotated instructions that illustrate the intricate urban scenes, outdoor VLN remains a challenging task to solve. In this paper, we introduce a Multimodal Text Style Transfer (MTST) learning approach and leverage external multimodal resources to mitigate data scarcity in outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external outdoor navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 8.7% relatively on the test set.
Anthology ID:
2021.eacl-main.103
Volume:
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Month:
April
Year:
2021
Address:
Online
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1207–1221
Language:
URL:
https://aclanthology.org/2021.eacl-main.103
DOI:
10.18653/v1/2021.eacl-main.103
Bibkey:
Cite (ACL):
Wanrong Zhu, Xin Wang, Tsu-Jui Fu, An Yan, Pradyumna Narayana, Kazoo Sone, Sugato Basu, and William Yang Wang. 2021. Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1207–1221, Online. Association for Computational Linguistics.
Cite (Informal):
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation (Zhu et al., EACL 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2021.eacl-main.103.pdf
Data
Touchdown Dataset