Skip to content

Official implementation of ICLR 2026 paper "Urban Socio-Semantic Segmentation with Vision-Language Reasoning"

License

Notifications You must be signed in to change notification settings

AMAP-ML/SocioReasoner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SocioReasoner

arXiv Open In Colab

Official implementation of the ICLR 2026 paper "Urban Socio-Semantic Segmentation with Vision-Language Reasoning".

Abstract: This paper introduces the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's significant gains over state-of-the-art models and strong zero-shot generalization.

0. Demo

You can run it in Google Colab.

demo

Left: Wangjing SOHO. Right: Wuhan University.

1. Installation

  • OS: Linux distribution support for CUDA
  • Hardware: At least 4x NVIDIA H20 (or A100 80GB) GPUs
  • Framework: This repository is based on ROLL, following the below installation instructions.
conda create -n socioseg python=3.10 -y
conda activate socioseg
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
pip install 'transformer-engine[pytorch]==2.2.0' deepspeed==0.16.4 vllm==0.8.4 --no-build-isolation
pip install -e .

2. Dataset and Pretrained Model

Huggingface dataset: SocioSeg. Raw dataset files: Google Drive.

Pretrained Model: Huggingface model: SocioReasoner-3B.

3. Training

# default using the hugginface dataset
sh examples/train/train.sh

If you want to use the raw dataset files, please change the actor_train.data_args.file_name and validation.data_args.file_name in examples/train/rlvr_megatron.yaml to your local dataset path.

The trained model will be saved in ./output/train/checkpoint/

4. Evaluation and Visualization

# default using the hugginface dataset and model
sh examples/infer/infer.sh

If you want to use the raw dataset files or the model trained by yourself, please change the actor_train.data_args.file_name and pretrain in examples/infer/rlvr_megatron.yaml

The evaluation and visualization results will be saved in ./output/infer/result/

Citation

@article{wang2026socioreasoner,
  title={Urban Socio-Semantic Segmentation with Vision-Language Reasoning}, 
  author={Yu Wang and Yi Wang and Rui Dai and Yujie Wang and Kaikui Liu and Xiangxiang Chu and Yansheng Li},
  journal={arXiv preprint arXiv:2601.10477},
  year={2026}
}

Acknowledgements

We thank the authors of ROLL and SegZero.

About

Official implementation of ICLR 2026 paper "Urban Socio-Semantic Segmentation with Vision-Language Reasoning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages