This is the official implementation of "Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning".
Paper: Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
Overview of the Scaf-GRPO framework:
Scaf-GRPO is a progressive training framework designed to overcome the "learning cliff" in reinforcement learning for LLMs. When a model consistently fails on difficult problems, leading to zero-reward signals and stalled progress, Scaf-GRPO intervenes with minimal, hierarchical guidance. By injecting tiered in-prompt hints—from abstract concepts to concrete steps—it enables the model to construct a valid solution, restoring the learning gradient and unlocking its ability to solve problems previously beyond its reach. This on-policy scaffolding approach preserves the model's exploratory autonomy while effectively extending the frontier of its reasoning capabilities.
- Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
Our implementation is built upon the excellent verl framework, specifically version 0.4.1.dev. Our installation process is therefore heavily based on their official guide.
You can either refer to the complete VERL Installation Guide for more details or follow the steps below for a tested and verified setup for Scaf-GRPO.
We recommend using Conda to manage the environment.
conda create --name scaf-grpo python=3.10
conda activate scaf-grpoInstall the PyTorch version compatible with your CUDA environment. We used CUDA 12.4.
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124VERL provides a convenient script to install its core dependencies. Run the following command from the root of this repository:
USE_MEGATRON=0 USE_SGLANG=0 bash scripts/install_vllm_sglang_mcore.shNote: We disable SGLANG and MEGATRON as they are not required for our Scaf-GRPO implementation.
Finally, install our project in editable mode along with the remaining requirements.
# Install this project in editable mode (without dependencies, as we handle them separately)
pip install --no-deps -e .
# Install the remaining Python packages
pip install -r requirements.txt
# Install the symeval library for mathematical evaluation
pip install "git+https://github.com/tongyx361/symeval.git"After these steps, your environment should be fully configured to run Scaf-GRPO.
Our training pipeline is managed through shell scripts. The main entry point is train.sh.
Our training data, including pre-generated hierarchical hints, is publicly available on Hugging Face at hkuzxc/scaf-grpo-dataset.
Run the following command from the root of the Scaf-GRPO repository to clone the dataset into the correct directory (data/DeepScaleR):
git clone https://huggingface.co/datasets/hkuzxc/scaf-grpo-dataset data/DeepScaleRScaf-GRPO/
└── data/
└── DeepScaleR/
├── DeepSeek-R1-Distill-Qwen-1.5B.parquet
├── Llama-3.2-3B-Instruct.parquet
├── Qwen2.5-7B.parquet
├── Qwen2.5-Math-1.5B.parquet
├── Qwen2.5-Math-7B.parquet
├── README.md
└── train.parquet
All training configurations are centralized in shell scripts located in the sh/ directory. For a standard Scaf-GRPO run, you will need to edit sh/hint_mix_grpo/bs256_6k_mix.sh.
Open this file and modify the following key variables:
PROJECT_NAME&EXP_NAME: Set your Weights & Biases project and experiment names for logging.MODEL_PATH: Specify the path to the base model you want to train. This can be a Hugging Face model identifier (e.g.,"meta-llama/Llama-2-7b-hf") or a path to a local checkpoint.data_train_path: Set the path to your training data. After downloading, this should be"data/DeepScaleR/train.parquet".
Example configuration in sh/hint_mix_grpo/bs256_6k_mix.sh:
PROJECT_NAME='scaf-grpo-project'
EXP_NAME='Qwen2.5-math-7b-scaf-grpo'
MODEL_PATH="Qwen/Qwen2.5-Math-7B"
data_train_path="data/DeepScaleR/Qwen2.5-Math-7B.parquet"You can also adjust other hyperparameters like learning rate (lr), batch size (train_batchsize), etc., in the same script.
⚠️ Important Note for Qwen2.5-Math ModelsThe default configuration of the
Qwen2.5-mathseries models has amax_position_embeddingsof 4096. Our training setup requires handling a combined length ofdata.max_prompt_length=4096anddata.max_response_length=2048, which exceeds this limit, especially when hints are added to the prompt.To accommodate these longer sequences, we have manually modified the
config.jsonof these models. If you are using a freshQwen2.5-mathmodel, you will need to apply these changes to itsconfig.jsonfile as well:{ "sliding_window": null, "rope_theta": 15000, "max_position_embeddings": 6144 }
The main train.sh script is a convenient wrapper that sets up the environment and calls the specific training script. Before running, ensure your WANDB_API_KEY is set in this file.
# In train.sh
export WANDB_API_KEY="YOUR_WANDB_API_KEY"
...Once configured, simply run train.sh to start the training:
bash train.shThis will activate the conda environment and execute the Scaf-GRPO training script (sh/hint_mix_grpo/bs256_6k_mix.sh) with an 8-GPU setup.
We also provide a script for the standard GRPO baseline. To run it, you need to make two changes:
- Configure the baseline script: Open
sh/baseline/grpo/bs256_6k.shand setMODEL_PATH,data_train_path, etc., just as you did for the Scaf-GRPO script. - Modify
train.sh: Change the script being executed from the Scaf-GRPO one to the baseline one.
# In train.sh, change this line:
# from:
bash sh/hint_mix_grpo/bs256_6k_mix.sh
# to:
bash sh/baseline/grpo/bs256_6k.shThen, run bash train.sh as before to start the baseline experiment.
Our evaluation process is handled by a single, comprehensive script that first generates responses from a trained model and then evaluates them.
To evaluate a checkpoint, you need to configure the main evaluation script: sh/generation_eval/do_sample.sh.
Open this file and modify the following key variables at the top:
model_path: Path to the trained checkpoint directory you want to evaluate.data_path: Path to the specific benchmark dataset file (e.g.,data/MATH-500/.../test.parquet).save_root_dir: The base directory where all evaluation outputs and results will be saved.
You can also adjust generation parameters like n_samples, temperature, and top_p within this script.
Once the script is configured, simply execute it. The script will handle both the response generation and the subsequent evaluation automatically.
bash sh/generation_eval/do_sample.shThe script first calls verl.trainer.main_generation to produce model outputs and saves them to a file. It then immediately calls verl.trainer.main_eval to process that output file and compute the final scores.
The final results, including generated responses and their evaluation metrics, will be stored in the save_path constructed within the script (e.g., ${save_root_dir}/.../generation_output.parquet).
If you find our work and resources useful in your research, please consider citing our paper:
@article{zhang2025scafgrpo,
title={{Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning}},
author={Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia},
journal={arXiv preprint arXiv:2510.19807},
year={2025}
}We would like to thank the following projects for their great work and inspiration:
