"Mβ΄-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection"
by Jiyuan Liu, Jia Lin, Xiaofei Zhou*, Runmin Cong, Deyang Liu, Zhi Liu
π CVPR 2026 Accepted!
π Paper (arXiv) | π CVPR Open Access | π» Code (GitHub)
β If you find this work helpful, please consider giving us a star!
We propose Mβ΄-SAM, a prompt-free framework that adapts SAM2 for RGB-D video salient object detection by introducing modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization.
Key Highlights:
- π‘ Modality-Aware MoE-LoRA: elevates vanilla LoRA with convolutional experts and modality-specific routing for adaptive RGB-D feature fusion and efficient fine-tuning.
- π§© Gated Multi-Level Feature Fusion: hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism to balance spatial details and semantic context.
- π Pseudo-Guided Initialization: bootstraps the memory bank using a coarse mask as a pseudo prior, enabling zero-shot VSOD without manual prompts.
OS/Hardware Compatibility Note:
This codebase was developed and tested exclusively on Ubuntu/Linux. We strongly recommend using a Linux environment.
Please note that slight performance variations may occur due to differences in OS versions, GPU models, and CUDA drivers. We appreciate your understanding.
# Enter the codebase directory
cd M4SAM_Code
# Create and activate conda environment
conda create -n m4sam python==3.10
conda activate m4sam
# Install PyTorch with CUDA support
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
# Install other dependencies
pip install -r requirements.txtThis downloads sam2.1_hiera_large.pt from Meta AI.
cd checkpoints
bash download_sam_ckpt.sh
cd ..We provide our model checkpoints to help you easily reproduce the performance metrics reported in our paper.
| Dataset | Source Repo | Checkpoint |
|---|---|---|
| RDVS | https://github.com/kerenfu/RDVS | M4SAM-rdvs.pth |
| ViDSOD-100 | https://github.com/jhl-Det/RGBD_Video_SOD | M4SAM-vidsod.pth |
| DViSal | https://github.com/DVSOD/DVSOD-DViSal | M4SAM-dvisal.pth |
Dataset Path Note: Please download the datasets from their official sources linked in the table above. Extract them into a single parent directory (e.g.,
/data). Your folder structure should look like this:/data/ βββ DViSal_dataset/ β βββ data/ β βββ test_all.txt βββ RDVS/ β βββ test/ β βββ train/ βββ VidSOD/ βββ test/ βββ train/When running evaluation or training, the
--test_image_path/--train_image_pathargument should point to this parent directory (e.g.,/data).
Place the downloaded checkpoints under the checkpoints/ directory:
checkpoints/
βββ sam2.1_hiera_large.pt # SAM2 (from Step 2)
βββ M4SAM-dvisal.pth # DViSal
βββ M4SAM-rdvs.pth # RDVS
βββ M4SAM-vidsod.pth # ViDSOD-100
You can run both inference and evaluation using the following parameterized bash script.
#!/bin/bash
# A quick script to run inference and evaluation
vid_len=16
device=0
dataset="rdvs" # Options: "rdvs", "vidsod", "dvisal"
data_path="/data" # Update this to your local data parent directory
output_dir="./results/${dataset}_pred"
# Set ground truth path based on dataset
if [ "$dataset" = "dvisal" ]; then
gt_path="${data_path}/DViSal_dataset/data"
elif [ "$dataset" = "rdvs" ]; then
gt_path="${data_path}/RDVS/test"
elif [ "$dataset" = "vidsod" ]; then
gt_path="${data_path}/VidSOD/test"
fi
echo "Step 1: Running inference..."
python test.py \
--vid_len $vid_len \
--device $device \
--ckpt checkpoints/M4SAM-${dataset}.pth \
--test_image_path "$data_path" \
--dataset $dataset \
--save_path "$output_dir" \
--save 1
echo "Step 2: Evaluating..."
python eval_tool.py \
--dataset $dataset \
--pred_path "$output_dir" \
--gt_path "$gt_path"Training uses PyTorch DDP for distributed multi-GPU training.
#!/bin/bash
dataset="rdvs" # Options: "rdvs", "vidsod", "dvisal"
data_path="/data" # Update this to your local data parent directory
# Set epoch based on dataset
if [ "$dataset" = "dvisal" ]; then
epoch=50
elif [ "$dataset" = "rdvs" ]; then
epoch=60
elif [ "$dataset" = "vidsod" ]; then
epoch=30
fi
python train_ddp.py \
--batch_size 4 \
--device 0,1 \
--epoch $epoch \
--vid_len 4 \
--conti 0 \
--lr 0.001 \
--sync_bn 1 \
--dataset $dataset \
--train_image_path "$data_path"Our work would not have been possible without the following open-source projects:
- SAM2,
- XMem,
- MemSAM,
- SAM2-UNet,
- PySODMetrics.
Thanks for their great contributions!
If you find our work useful, please cite our paper, thank you!
@InProceedings{Liu_2026_CVPR,
author = {Liu, Jiyuan and Lin, Jia and Zhou, Xiaofei and Cong, Runmin and Liu, Deyang and Liu, Zhi},
title = {M4-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026},
pages = {24970-24979}
}This project is licensed under the CC BY-NC 4.0 License.