1Chaofeng Chen, 1Sensen Yang, 1Haoning Wu, 1Liang Liao, 3Zicheng Zhang, 1AnnanWang, 2Wenxiu Sun, 2Qiong Yan, 1Weisi Lin
1S-Lab, Nanyang Technological University, 2Sensetime Research, 3Shanghai Jiao Tong University
Q-Ground is a multimodal model for localizing and describing image quality distortions. This repository provides the released model weights, dataset download utilities, a Gradio demo, evaluation scripts, and multi-stage training scripts used for the released version.
✅ Release all stage model weights in 🤗Hugging Face Q-Ground Collection
✅ Release test codes
✅ Release training codes
✅ Release datasets in 🤗Hugging Face QGround-100K
Create a Python environment first. Install a PyTorch build compatible with your CUDA setup before installing the project dependencies.
conda create -n qg python=3.10 -y
conda activate qg
# Install PyTorch and DeepSpeed separately if needed for your machine.
pip install -r requirements.txt
Launch the Gradio demo:
bash test_chat.shThis starts chat_app.py, saves predicted masks to ./tmp_chat_masks, and exposes a local Gradio interface with example images from ./example_imgs.
The training pipeline expects datasets under ./dataset. The released dataset snapshot on Hugging Face contains the required data for training and evaluation, organized in a way that the training scripts can directly use after extraction.
Download the released dataset snapshot from Hugging Face into ./dataset:
python download_datasets.pyIf you do not have direct access to Hugging Face, use the mirror endpoint before downloading:
export HF_ENDPOINT=https://hf-mirror.com
python download_datasets.pyExtract all supported archives under ./dataset:
python extract_archives.py --root ./datasetIf you want to inspect or customize label processing, see ./utils_label.py. The released download and extraction flow is intended to perform the required preparation automatically.
The released training pipeline uses four stages. Compared with the paper version, the released model adds a mask decoder pretraining stage before final Q-Ground finetuning.
| Stage | Script | Initialization | Purpose |
|---|---|---|---|
| 1 | ./train_stage1.sh |
liuhaotian/llava-v1.5-7b |
Align multiscale visual-language features |
| 2 | ./train_stage2.sh |
Stage 1 checkpoint | Improve instruction following |
| 3 | ./train_stage3.sh |
Stage 2 checkpoint | Pretrain the mask decoder with segmentation-heavy data |
| 4 | ./train_stage4.sh |
Stage 3 checkpoint | Finetune on Q-Ground quality grounding data |
Run the stages in order using bash train_stage[1-4].sh.
Each stage script launches ./train_ds.py with DeepSpeed and saves a Hugging Face-style checkpoint to a local output directory (for example, ./qg_model_multiscale_stage1 to ./qg_model_multiscale_stage4).
- The scripts expect GPUs and a working DeepSpeed installation.
- The default dataset root is
./dataset/. - The released Stage 3 mask decoder pretraining noticeably improves mask quality compared with the paper configuration.
Released checkpoints are available in the 🤗 Hugging Face Q-Ground Collection.
The default inference and evaluation scripts load --version chaofengc/qg_model_multiscale.
You can override --version in ./evaluate_model.py or adapt the shell scripts if you want to test a local stage checkpoint.
Evaluate the released model on the Q-Ground benchmark with:
bash test_bench.shThe script runs ./evaluate_model.py and stores outputs in ./tmp_qground_results.
Compared with the paper, the extra mask decoder pretraining stage (Stage 3) significantly improves the mask prediction performance (mIoU), which is the key to quality grounding. The performance of the final Q-Ground model on the benchmark compared with the paper is as follows:
| Model | Jitter | Noise | Overexposure | Blur | Low light | Average |
|---|---|---|---|---|---|---|
| Paper mIoU | 0.434 | 0.051 | 0.125 | 0.460 | 0.219 | 0.271 |
| Released mIoU | 0.4201 | 0.1793 | 0.2756 | 0.4651 | 0.2971 | 0.3274 |
| Paper mAcc | 0.720 | 0.176 | 0.459 | 0.648 | 0.337 | 0.539 |
| Released mAcc | 0.7187 | 0.4138 | 0.5172 | 0.6098 | 0.4461 | 0.5411 |
If you find this work useful, please consider to cite our paper:
@inproceedings{chen2024qground,
title={Q-Ground: Image Quality Grounding with Large Multi-modality Models},
author={Chaofeng Chen and Sensen Yang and Haoning Wu and Liang Liao and Zicheng Zhang and Annan Wang and Wenxiu Sun and Qiong Yan and Weisi Lin},
Journal = {ACM International Conference on Multimedia},
year={2024},
}This project is based on PixelLM, LISA and LLaVA. Thanks to the authors for their great work!
