Skip to content

[WIP]feat: support colocated on-policy training#1035

Closed
HT-Yuan wants to merge 4 commits intoinclusionAI:mainfrom
HT-Yuan:feat/support_colocated_on_policy
Closed

[WIP]feat: support colocated on-policy training#1035
HT-Yuan wants to merge 4 commits intoinclusionAI:mainfrom
HT-Yuan:feat/support_colocated_on_policy

Conversation

@HT-Yuan
Copy link
Copy Markdown
Contributor

@HT-Yuan HT-Yuan commented Mar 14, 2026

Description

Support colocated (GPU time-sharing) on-policy training mode, where the training engine and inference engine share the same set of GPUs by alternating between offloaded/onloaded states via torch_memory_saver.

In colocated mode, weights are transferred through a local disk path (typically /dev/shm) for fast in-memory synchronization. The lifecycle per training step is:

[Inference on GPU] → rollout → offload inference / onload training → train step + save weights to disk → save HF checkpoint + recover checkpoint → offload training / onload inference + load weights from disk → [Inference on GPU] → next rollout

Alternatives Considered

Approach Description Pros Cons
CUDA IPC Share GPU tensors directly between training and inference processes via CUDA IPC handles. Was used in early versions of Slime but has since been deprecated. Zero-copy, fastest weight sync Already deprecated in Slime; highly invasive; tight coupling between training and inference memory layouts
GLOO-based CPU transfer Create a GLOO process group between training and inference processes, broadcast/send weights through CPU memory (GPU→CPU on trainer side, CPU→GPU on inference side) No disk I/O; direct inter-process communication; GLOO is well-supported in PyTorch Requires establishing a cross-process group between the training engine and the SGLang inference server, which runs in a separate process with its own init; double memory copy (GPU→CPU→GPU) with CPU memory pressure for large models; significant plumbing to bridge AReaL's distributed groups with SGLang's internal process model; higher code complexity and maintenance burden
Disk-based sync (chosen) Leverage existing save_to_disk + update_weights_from_disk path already used by the colocated engine Minimal code changes; reuses battle-tested paths; easy to reason about correctness; no new dependencies Disk I/O overhead (acceptable for on-policy where sync happens once per epoch)

Why disk-based?

  1. Minimal invasiveness: The colocated engine already has prepare_for_inference
    which saves weights to disk and loads them into the inference engine. On-policy
    mode simply ensures this happens synchronously at the right time.
  2. Correctness confidence: Reusing existing, well-tested weight sync paths
    reduces the risk of subtle bugs (e.g., partial weight updates, memory leaks).
  3. Performance is acceptable: For on-policy training, the weight sync happens
    once per PPO epoch. The disk I/O cost (~seconds) is negligible compared to
    the full rollout + training step duration.
  4. Future optimization path: If disk I/O becomes a bottleneck (e.g., very
    large models or frequent syncs), we can later explore new mechanisms without changing the on-policy
    control flow.

Key changes:

  • areal/infra/colocated.py (new): ColocatedOrchestrator and ColocatedConfig that manage GPU ownership switching between training and inference engines, including idempotent prepare_for_training() / prepare_for_inference() transitions and direct disk-based weight updates bypassing name_resolve coordination.
  • areal/api/cli_args.py: Added colocated and colocated_weight_path fields to PPOActorConfig with validation (must use weight_update_mode='disk').
  • areal/api/io_struct.py: Added WeightUpdateMeta.from_colocated_disk() factory method for creating ephemeral, local-disk-based weight update metadata.
  • areal/infra/__init__.py: Exported ColocatedConfig and ColocatedOrchestrator.
  • areal/infra/launcher/local.py: Detect colocated mode, roll back GPU counter so trainer reuses inference server GPUs, and inject TMS environment variables.
  • areal/trainer/rl_trainer.py: Integrated colocated orchestration into PPOTrainer — initialization, train/inference GPU switching, stats snapshot before GPU handoff, recover checkpoint handling, and safe teardown (onload actor before destroy to avoid TMS invalid free).
  • areal/utils/stats_tracker.py: When reduce_group=None (local-only export), force tensor creation on CPU to avoid unnecessary CUDA dependency.
  • examples/math/gsm8k_grpo_colocated.yaml (new): Example config for GSM8K GRPO with colocated training on 8 GPUs.
  • tests/test_colocated_engine.py (new): Unit tests for ColocatedOrchestrator lifecycle, WeightUpdateMeta.from_colocated_disk, and PPOTrainer colocated config validation.

Experiment Results

Verified on GSM8K GRPO with Qwen2.5-1.5B, single node 8×GPU, using python3 -m areal.infra.launcher.local examples/math/gsm8k_rl.py --config examples/math/gsm8k_grpo_colocated.yaml.

mode max task_reward
Standard 0.79570
Colocated (this PR) 0.80180

Key observations:

  • Colocated mode achieves comparable/better eval reward.
  • End-to-end training completed successfully for 10 epochs with no divergence.
  • Weight sync overhead (timeperf/update_weights) averages ~3s per step.

Related Issue

Fixes #992

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant feature that enables GPU time-sharing for on-policy training, allowing both the training and inference components to operate on the same set of GPUs. This is achieved through a novel orchestration mechanism that manages the dynamic offloading and onloading of models, coupled with an efficient disk-based weight synchronization method. The change aims to optimize GPU resource utilization, particularly in single-node environments, by reducing the need for separate GPU allocations for training and inference, thereby improving overall efficiency and potentially reducing operational costs.

Highlights

  • Colocated Training Mode: Introduced a new colocated (GPU time-sharing) on-policy training mode, allowing the training and inference engines to share the same GPUs by alternating between offloaded/onloaded states via torch_memory_saver.
  • Weight Transfer Mechanism: Implemented a fast, local disk-based weight transfer mechanism (typically using /dev/shm) for in-memory synchronization between the training and inference engines in colocated mode.
  • Orchestration and Configuration: Added ColocatedOrchestrator and ColocatedConfig to manage GPU ownership switching, idempotent state transitions, and direct disk-based weight updates, bypassing name_resolve coordination.
  • Trainer Integration: Integrated the colocated orchestration into PPOTrainer, including initialization, GPU switching during training/inference cycles, stats snapshotting, checkpoint recovery handling, and safe teardown procedures.
  • Validation and Examples: Included comprehensive validation for colocated mode configurations and provided a new example configuration (gsm8k_grpo_colocated.yaml) for GSM8K GRPO with colocated training, along with dedicated unit tests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • areal/api/cli_args.py
    • Added colocated and colocated_weight_path fields to PPOActorConfig to enable and configure the new GPU time-sharing mode.
    • Implemented validation in PPOActorConfig to enforce weight_update_mode='disk' when colocated mode is enabled.
  • areal/api/io_struct.py
    • Added a new factory method WeightUpdateMeta.from_colocated_disk() for creating ephemeral, local-disk-based weight update metadata specifically for colocated mode.
  • areal/infra/init.py
    • Exported ColocatedConfig and ColocatedOrchestrator to make them accessible within the areal.infra package.
  • areal/infra/colocated.py
    • Added a new file defining ColocatedConfig for configuration parameters related to colocated training.
    • Added a new file defining ColocatedOrchestrator to manage the GPU time-sharing lifecycle between training and inference engines, including offload/onload operations and direct disk-based weight updates.
  • areal/infra/launcher/local.py
    • Modified local_main to detect colocated mode and ensure torch_memory_saver environment variables are injected.
    • Adjusted the GPU counter to allow the trainer to reuse inference server GPUs in colocated mode.
  • areal/trainer/rl_trainer.py
    • Imported nullcontext and torch_memory_saver for conditional memory management.
    • Refactored PPOTrainer initialization to conditionally set up actor, critic, ref, and teacher based on colocated mode.
    • Integrated ColocatedOrchestrator into PPOTrainer for managing GPU ownership and weight synchronization.
    • Updated checkpoint recovery logic to handle the offloaded state of the actor in colocated mode.
    • Modified the train loop to incorporate GPU switching logic (prepare_for_training, prepare_for_inference) and conditional pausing of rollout.
    • Adjusted weight saving logic to use actor.save directly in colocated mode for disk-based transfers.
    • Introduced methods _capture_train_stats_snapshot and _export_eval_stats_snapshot for managing statistics logging during GPU handoffs.
    • Ensured the actor is onloaded before destruction in close method for safe teardown in colocated mode.
    • Updated _init_rollout to explicitly allow LoRA in colocated mode.
    • Expanded _validate_cfg with specific checks for colocated mode, such as requiring SPMD, single-node, and disallowing online training, critic, ref, or teacher models.
  • areal/utils/stats_tracker.py
    • Modified _aggregate to force tensor creation on CPU when reduce_group is None for local-only statistics export.
  • examples/math/gsm8k_grpo_colocated.yaml
    • Added a new example configuration file for GSM8K GRPO demonstrating colocated training with shared GPUs and disk-based weight updates.
    • Configured enable_offload: true and actor.colocated: true.
    • Removed ref model and enabled enable_memory_saver for SGLang/vLLM in the example configuration.
  • tests/test_colocated_engine.py
    • Added a new file containing unit tests for ColocatedConfig and ColocatedOrchestrator functionality.
    • Included tests for WeightUpdateMeta.from_colocated_disk factory method.
    • Provided tests for PPOTrainer's validation rules specific to colocated mode.
Activity
  • No human activity to report on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: colocated on-policy training, allowing the training and inference engines to share GPUs. The implementation is well-structured, with a new ColocatedOrchestrator to manage the GPU ownership lifecycle. The changes are extensive, touching configuration, I/O structures, the local launcher, and the main PPOTrainer. The integration into PPOTrainer is particularly complex, handling initialization, state transitions during the training loop, and checkpoint recovery, but it appears to be handled correctly. The addition of a new example configuration and comprehensive unit tests is commendable. I have one high-severity finding regarding an unused and flawed public method in the new orchestrator. Otherwise, the changes look solid.

@HT-Yuan
Copy link
Copy Markdown
Contributor Author

HT-Yuan commented Mar 17, 2026

@garrett4wade
Hello, sorry to bother you.

As mentioned in issue #992 , this PR aims to add colocated on-policy training support for AreaL.
Could you kindly let me know what additional work would be needed to get this PR ready for merging? For example, documentation, more detailed unit tests, or any other improvements you suggest.

Thank you very much for your time and help. Have a great day!

Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HT-Yuan , thank you for the contribution. We appreciate your efforts, but the current implementation looks a little bit ad-hoc. We can do it in a more elegant way.

Comment on lines +1200 to +1216
# Colocated (GPU time-sharing) mode
colocated: bool = field(
default=False,
metadata={
"help": "Enable colocated mode where training and inference share the same GPUs. "
"When enabled, training and inference alternate via offload/onload with "
"weights transferred through a local disk path (e.g. /dev/shm)."
},
)
colocated_weight_path: str = field(
default="/dev/shm/areal_colocated_weights",
metadata={
"help": "Base path for temporary weight storage in colocated mode. "
"Defaults to /dev/shm for fast in-memory transfer. "
"Only effective when colocated=True."
},
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colocated or not should be determined by scheduling_strategy.

actor:
  ...

rollout:
  scheduling_strategy:
    type: collocation
    target: actor

This implies colocation.

In addition, the weight synchronization path is auto determined by cluster.fileroot.

No need to add additional fields in the config.

Comment on lines +260 to +268
@classmethod
def from_colocated_disk(
cls,
weight_path: str = "/dev/shm/areal_colocated_weights",
use_lora: bool = False,
lora_name: str = "",
lora_int_id: int = 1,
base_model_name: str = "",
) -> "WeightUpdateMeta":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO there's no difference from ordinary disk-based weight update?

Comment on lines +364 to +367
if config.get("enable_offload", False):
# Detect colocated mode: training and inference share the same GPUs.
is_colocated = config.get("actor", {}).get("colocated", False)

if is_colocated or config.get("enable_offload", False):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Launcher is not the recommended usage and will be deprecated in the future. The current launch process is training script (trainer, local python file) -> scheduler submitting remote processes -> remote worker process (areal/infra/rpc/rpc_server.py).

To enable colocation, we should modify the arguments of scheduler.create_workers instead. It has also been implemented. Use proper scheduling_strategy for the job can enable colocation.

Comment on lines +482 to +495
tms_ctx: Any | nullcontext[None] = (
torch_memory_saver.disable() if self._colocated else nullcontext()
)
with tms_ctx:
rollout_batch = self.actor.prepare_batch(
self.train_dataloader,
workflow=workflow,
workflow_kwargs=workflow_kwargs,
should_accept_fn=dynamic_filter_fn,
group_size=config.gconfig.n_samples,
dynamic_bs=self.config.dynamic_bs,
)
if self._colocated:
self.rollout.pause()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.actor and self.rollout, which should be FSDPEngine and RemoteSGLangEngine for example, have implemented the offload method. You can call self.actor.offload() to offload its parameter. What we should do is to add a proper context (and merge this context with the above timing context to avoid additional indentation) as an additional engine method. This would be more elegant.

@garrett4wade
Copy link
Copy Markdown
Collaborator

I just provided some style suggestions. It would also be great if you can self-review the PR with the /review-pr command, which will get you into some details that may be missed by humans.

@HT-Yuan
Copy link
Copy Markdown
Contributor Author

HT-Yuan commented Mar 17, 2026

@garrett4wade Thank you very much for your advice. I will refer to it to improve my implementation and hope to make AReal even better. Wish you a pleasant workday again.

@HT-Yuan HT-Yuan changed the title feat: support colocated on-policy training [WIP]feat: support colocated on-policy training Mar 17, 2026
@HT-Yuan HT-Yuan marked this pull request as draft March 17, 2026 07:22
@HT-Yuan HT-Yuan force-pushed the feat/support_colocated_on_policy branch from 12a47d9 to f5536f0 Compare March 17, 2026 18:38
@HT-Yuan
Copy link
Copy Markdown
Contributor Author

HT-Yuan commented Mar 19, 2026

@garrett4wade I have noticed the implementation and discussion of #999, and I believe update_weight_for_tensor is the better choice.
This PR explored a GPU time‑sharing approach based on disk weight offload/onload orchestration for colocated on‑policy training. It appears to be a different direction from the preferred implementation, so I will close this PR here.
Thanks for the feedback.
Best regards with your work.

@HT-Yuan HT-Yuan closed this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Question] 现在支持rollout和actor的共卡场景了吗?代码里有共卡,但是似乎不是这个场景

2 participants