A lightweight VLN framework with inverse dynamics supervision for action-grounded visual dynamics
| Date | News |
|---|---|
| 2026-03-17 | 🔥 Code Release: Checkpoints and full code are now available! |
| 2026-01-26 | 📄 Paper Release: Paper is available on arXiv! |
NaVIDA is a lightweight Vision-Language Navigation (VLN) framework that incorporates inverse dynamics supervision as an explicit objective to embed action-grounded visual dynamics into policy learning. We employ hierarchical probabilistic action chunking to organize trajectories into multi-step chunks, enabling more effective navigation in continuous environments.
- 🎯 Inverse Dynamics Supervision: Explicitly learns action-grounded visual dynamics
- 🔄 Hierarchical Action Chunking: Organizes trajectories into multi-step chunks for better long-horizon planning
- 🚀 Lightweight Design: Efficient framework suitable for real-world deployment
- 🏆 State-of-the-art Performance: Achieves competitive results on VLN-CE benchmarks
Figure 1: Overview of NaVIDA framework. The model leverages inverse dynamics supervision to learn action-aware visual representations.
- Python 3.10
- CUDA 11.8+
- Habitat-Sim v0.2.4
- Habitat-Lab v0.2.4
1. Create Conda Environment
conda create -n navida python=3.10
conda activate navida2. Install habitat-sim v0.2.4
git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-sim.git
cd habitat-sim
pip install -r requirements.txt
python setup.py install --headless3. Install habitat-lab v0.2.4
git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab # install habitat_lab
pip install -e habitat-baselines # install habitat_baselines
pip install dtw fastdtw gym4. Install NaVIDA Dependencies
pip install peft trl==0.16.0 transformers==4.50.3 tensorboardx qwen_vl_utils deepspeed distilabel wandb==0.18.3
pip install numpy==1.24.0 numba==0.60.0 tqdm opencv-python
pip install vllm==0.9.1 torch torchvision protobuf==3.20
pip install flash-attn --no-build-isolation --no-cache-dir| Benchmark | Scene Dataset | Link |
|---|---|---|
| R2R, RxR, EnvDrop | MP3D | Official Page |
| ScaleVLN | HM3D | Official GitHub |
Place the datasets as follows:
data/
├── scene_datasets/
│ ├── mp3d/ # MP3D scenes for R2R/RxR/EnvDrop
│ └── hm3d/ # HM3D scenes for ScaleVLN (train split)
| Dataset | Link | Remark |
|---|---|---|
| R2R VLN-CE Episodes | Google Drive | N/A |
| N/A | ||
| RxR VLN-CE Episodes | HuggingFace | action space aligned with R2R by StreamVLN |
| ScaleVLN Subset | HuggingFace | N/A |
Figure 2: Data processing and training pipeline of NaVIDA.
# Convert data to unified format
./scripts/preprocess.sh
# Extract RGB frames
./scripts/extract_frame.sh
# Prepare training data
./scripts/prepare_training_data.shNote: You can also prepare your DAgger data in the same format.
./scripts/train.sh./scripts/eval.shStep 1: Launch vLLM Server
./scripts/start_vllm_server.shStep 2: Run Evaluation
./scripts/eval_vllm.sh| Model | Link |
|---|---|
| NaVIDA-3B | HuggingFace |
NaVIDA achieves state-of-the-art performance on both R2R and RxR benchmarks with only 3B parameters, using only single RGB camera input.
R2R Val-Unseen
| Method | #Params | SR | SPL | NE | OS |
|---|---|---|---|---|---|
| NaVILA | 8B | 54.0 | 49.0 | 5.22 | 62.5 |
| StreamVLN | 7B | 56.9 | 51.9 | 4.98 | 64.2 |
| NaVIDA (Ours) | 3B | 61.4 | 54.7 | 4.32 | 69.5 |
RxR Val-Unseen
| Method | #Params | SR | SPL | nDTW | NE |
|---|---|---|---|---|---|
| NaVILA | 8B | 49.3 | 44.0 | 58.8 | 6.77 |
| StreamVLN | 7B | 52.9 | 46.0 | 61.9 | 6.22 |
| NaVIDA (Ours) | 3B | 57.4 | 49.6 | 67.0 | 5.23 |
If you find our work helpful, please consider starring this repo ⭐ and cite:
@article{zhu2026navida,
title={NaVIDA: Vision-Language Navigation with Inverse Dynamics Augmentation},
author={Zhu, Weiye and Zhang, Zekai and Wang, Xiangchen and Pan, Hewei and Wang, Teng and Geng, Tiantian and Xu, Rongtao and Zheng, Feng},
journal={arXiv preprint arXiv:2601.18188},
year={2026}
}We would like to thank the authors of the following projects for their great works:

