Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

Tianyi Ma, Yue Zhang, Zehao Wang, Parisa Kordjamshidi


Abstract
Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and previous actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.
Anthology ID:
2026.acl-long.595
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13035–13065
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.595/
DOI:
Bibkey:
Cite (ACL):
Tianyi Ma, Yue Zhang, Zehao Wang, and Parisa Kordjamshidi. 2026. Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13035–13065, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents (Ma et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.595.pdf
Checklist:
 2026.acl-long.595.checklist.pdf