Skip to content

waynechu1021/NAVIDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NaVIDA: Vision-Language Navigation with Inverse Dynamics Augmentation

A lightweight VLN framework with inverse dynamics supervision for action-grounded visual dynamics

Paper checkpoint Stars

📰 News

Date News
2026-03-17 🔥 Code Release: Checkpoints and full code are now available!
2026-01-26 📄 Paper Release: Paper is available on arXiv!

📖 Abstract

NaVIDA is a lightweight Vision-Language Navigation (VLN) framework that incorporates inverse dynamics supervision as an explicit objective to embed action-grounded visual dynamics into policy learning. We employ hierarchical probabilistic action chunking to organize trajectories into multi-step chunks, enabling more effective navigation in continuous environments.

Key Features

  • 🎯 Inverse Dynamics Supervision: Explicitly learns action-grounded visual dynamics
  • 🔄 Hierarchical Action Chunking: Organizes trajectories into multi-step chunks for better long-horizon planning
  • 🚀 Lightweight Design: Efficient framework suitable for real-world deployment
  • 🏆 State-of-the-art Performance: Achieves competitive results on VLN-CE benchmarks

NaVIDA Architecture

Figure 1: Overview of NaVIDA framework. The model leverages inverse dynamics supervision to learn action-aware visual representations.


🛠 Getting Started

Prerequisites

Setup the Environment

1. Create Conda Environment

conda create -n navida python=3.10
conda activate navida

2. Install habitat-sim v0.2.4

git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-sim.git
cd habitat-sim
pip install -r requirements.txt
python setup.py install --headless

3. Install habitat-lab v0.2.4

git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab          # install habitat_lab
pip install -e habitat-baselines    # install habitat_baselines
pip install dtw fastdtw gym

4. Install NaVIDA Dependencies

pip install peft trl==0.16.0 transformers==4.50.3 tensorboardx qwen_vl_utils deepspeed distilabel wandb==0.18.3
pip install numpy==1.24.0 numba==0.60.0 tqdm opencv-python 
pip install vllm==0.9.1 torch torchvision protobuf==3.20
pip install flash-attn --no-build-isolation --no-cache-dir

📁 Data Preparation

Step 1: Scene Datasets

Benchmark Scene Dataset Link
R2R, RxR, EnvDrop MP3D Official Page
ScaleVLN HM3D Official GitHub

Place the datasets as follows:

data/
├── scene_datasets/
│   ├── mp3d/          # MP3D scenes for R2R/RxR/EnvDrop
│   └── hm3d/          # HM3D scenes for ScaleVLN (train split)

Step 2: Download VLN Episodes

Dataset Link Remark
R2R VLN-CE Episodes Google Drive N/A
RxR VLN-CE Episodes Google Drive N/A
RxR VLN-CE Episodes HuggingFace action space aligned with R2R by StreamVLN
ScaleVLN Subset HuggingFace N/A

Step 3: Data Processing (Training Only)

NaVIDA Pipeline

Figure 2: Data processing and training pipeline of NaVIDA.

# Convert data to unified format
./scripts/preprocess.sh

# Extract RGB frames
./scripts/extract_frame.sh

# Prepare training data
./scripts/prepare_training_data.sh

Note: You can also prepare your DAgger data in the same format.

🔥 Training

./scripts/train.sh

🧭 Evaluation

Option 1: Eval with HuggingFace Transformers

./scripts/eval.sh

Option 2: Eval with vLLM (Faster Inference)

Step 1: Launch vLLM Server

./scripts/start_vllm_server.sh

Step 2: Run Evaluation

./scripts/eval_vllm.sh

🏆 Checkpoints

Model Link
NaVIDA-3B HuggingFace

📈 Results

NaVIDA achieves state-of-the-art performance on both R2R and RxR benchmarks with only 3B parameters, using only single RGB camera input.

R2R Val-Unseen
Method #Params SR SPL NE OS
NaVILA 8B 54.0 49.0 5.22 62.5
StreamVLN 7B 56.9 51.9 4.98 64.2
NaVIDA (Ours) 3B 61.4 54.7 4.32 69.5

RxR Val-Unseen
Method #Params SR SPL nDTW NE
NaVILA 8B 49.3 44.0 58.8 6.77
StreamVLN 7B 52.9 46.0 61.9 6.22
NaVIDA (Ours) 3B 57.4 49.6 67.0 5.23

🔗 Citation

If you find our work helpful, please consider starring this repo ⭐ and cite:

@article{zhu2026navida,
  title={NaVIDA: Vision-Language Navigation with Inverse Dynamics Augmentation},
  author={Zhu, Weiye and Zhang, Zekai and Wang, Xiangchen and Pan, Hewei and Wang, Teng and Geng, Tiantian and Xu, Rongtao and Zheng, Feng},
  journal={arXiv preprint arXiv:2601.18188},
  year={2026}
}

👏 Acknowledgements

We would like to thank the authors of the following projects for their great works:

About

NaVIDA: Vision-Language Navigation with Inverse Dynamics Augmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors