NaVIDA: Vision-Language Navigation with Inverse Dynamics Augmentation

A lightweight VLN framework with inverse dynamics supervision for action-grounded visual dynamics

📰 News

Date	News
2026-03-17	🔥 Code Release: Checkpoints and full code are now available!
2026-01-26	📄 Paper Release: Paper is available on arXiv!

📖 Abstract

NaVIDA is a lightweight Vision-Language Navigation (VLN) framework that incorporates inverse dynamics supervision as an explicit objective to embed action-grounded visual dynamics into policy learning. We employ hierarchical probabilistic action chunking to organize trajectories into multi-step chunks, enabling more effective navigation in continuous environments.

Key Features

🎯 Inverse Dynamics Supervision: Explicitly learns action-grounded visual dynamics
🔄 Hierarchical Action Chunking: Organizes trajectories into multi-step chunks for better long-horizon planning
🚀 Lightweight Design: Efficient framework suitable for real-world deployment
🏆 State-of-the-art Performance: Achieves competitive results on VLN-CE benchmarks

Figure 1: Overview of NaVIDA framework. The model leverages inverse dynamics supervision to learn action-aware visual representations.

🛠 Getting Started

Prerequisites

Python 3.10
CUDA 11.8+
Habitat-Sim v0.2.4
Habitat-Lab v0.2.4

Setup the Environment

1. Create Conda Environment

conda create -n navida python=3.10
conda activate navida

2. Install habitat-sim v0.2.4

git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-sim.git
cd habitat-sim
pip install -r requirements.txt
python setup.py install --headless

3. Install habitat-lab v0.2.4

git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab          # install habitat_lab
pip install -e habitat-baselines    # install habitat_baselines
pip install dtw fastdtw gym

4. Install NaVIDA Dependencies

pip install peft trl==0.16.0 transformers==4.50.3 tensorboardx qwen_vl_utils deepspeed distilabel wandb==0.18.3
pip install numpy==1.24.0 numba==0.60.0 tqdm opencv-python 
pip install vllm==0.9.1 torch torchvision protobuf==3.20
pip install flash-attn --no-build-isolation --no-cache-dir

📁 Data Preparation

Step 1: Scene Datasets

Benchmark	Scene Dataset	Link
R2R, RxR, EnvDrop	MP3D	Official Page
ScaleVLN	HM3D	Official GitHub

Place the datasets as follows:

data/
├── scene_datasets/
│   ├── mp3d/          # MP3D scenes for R2R/RxR/EnvDrop
│   └── hm3d/          # HM3D scenes for ScaleVLN (train split)

Step 2: Download VLN Episodes

Dataset	Link	Remark
R2R VLN-CE Episodes	Google Drive	N/A
~~RxR VLN-CE Episodes~~	~~Google Drive~~	N/A
RxR VLN-CE Episodes	HuggingFace	action space aligned with R2R by StreamVLN
ScaleVLN Subset	HuggingFace	N/A

Step 3: Data Processing (Training Only)

Figure 2: Data processing and training pipeline of NaVIDA.

# Convert data to unified format
./scripts/preprocess.sh

# Extract RGB frames
./scripts/extract_frame.sh

# Prepare training data
./scripts/prepare_training_data.sh

Note: You can also prepare your DAgger data in the same format.

🔥 Training

./scripts/train.sh

🧭 Evaluation

Option 1: Eval with HuggingFace Transformers

./scripts/eval.sh

Option 2: Eval with vLLM (Faster Inference)

Step 1: Launch vLLM Server

./scripts/start_vllm_server.sh

Step 2: Run Evaluation

./scripts/eval_vllm.sh

🏆 Checkpoints

Model	Link
NaVIDA-3B	HuggingFace

📈 Results

NaVIDA achieves state-of-the-art performance on both R2R and RxR benchmarks with only 3B parameters, using only single RGB camera input.

R2R Val-Unseen

Method	#Params	SR	SPL	NE	OS
NaVILA	8B	54.0	49.0	5.22	62.5
StreamVLN	7B	56.9	51.9	4.98	64.2
NaVIDA (Ours)	3B	61.4	54.7	4.32	69.5

RxR Val-Unseen

Method	#Params	SR	SPL	nDTW	NE
NaVILA	8B	49.3	44.0	58.8	6.77
StreamVLN	7B	52.9	46.0	61.9	6.22
NaVIDA (Ours)	3B	57.4	49.6	67.0	5.23

🔗 Citation

If you find our work helpful, please consider starring this repo ⭐ and cite:

@article{zhu2026navida,
  title={NaVIDA: Vision-Language Navigation with Inverse Dynamics Augmentation},
  author={Zhu, Weiye and Zhang, Zekai and Wang, Xiangchen and Pan, Hewei and Wang, Teng and Geng, Tiantian and Xu, Rongtao and Zheng, Feng},
  journal={arXiv preprint arXiv:2601.18188},
  year={2026}
}

👏 Acknowledgements

We would like to thank the authors of the following projects for their great works:

NaVid - Video-based VLN framework
StreamVLN - Streaming VLN approach
Habitat - 3D simulation platform

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
asset		asset
config		config
habitat_extensions		habitat_extensions
scripts		scripts
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NaVIDA: Vision-Language Navigation with Inverse Dynamics Augmentation

📰 News

📖 Abstract

Key Features

🛠 Getting Started

Prerequisites

Setup the Environment

📁 Data Preparation

Step 1: Scene Datasets

Step 2: Download VLN Episodes

Step 3: Data Processing (Training Only)

🔥 Training

🧭 Evaluation

Option 1: Eval with HuggingFace Transformers

Option 2: Eval with vLLM (Faster Inference)

🏆 Checkpoints

📈 Results

🔗 Citation

👏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NaVIDA: Vision-Language Navigation with Inverse Dynamics Augmentation

📰 News

📖 Abstract

Key Features

🛠 Getting Started

Prerequisites

Setup the Environment

📁 Data Preparation

Step 1: Scene Datasets

Step 2: Download VLN Episodes

Step 3: Data Processing (Training Only)

🔥 Training

🧭 Evaluation

Option 1: Eval with HuggingFace Transformers

Option 2: Eval with vLLM (Faster Inference)

🏆 Checkpoints

📈 Results

🔗 Citation

👏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages