[WIP]feat: support colocated on-policy training by HT-Yuan · Pull Request #1035 · inclusionAI/AReaL

HT-Yuan · 2026-03-14T09:39:05Z

Description

Support colocated (GPU time-sharing) on-policy training mode, where the training engine and inference engine share the same set of GPUs by alternating between offloaded/onloaded states via torch_memory_saver.

In colocated mode, weights are transferred through a local disk path (typically /dev/shm) for fast in-memory synchronization. The lifecycle per training step is:

[Inference on GPU] → rollout → offload inference / onload training → train step + save weights to disk → save HF checkpoint + recover checkpoint → offload training / onload inference + load weights from disk → [Inference on GPU] → next rollout

Alternatives Considered

Approach	Description	Pros	Cons
CUDA IPC	Share GPU tensors directly between training and inference processes via CUDA IPC handles. Was used in early versions of Slime but has since been deprecated.	Zero-copy, fastest weight sync	Already deprecated in Slime; highly invasive; tight coupling between training and inference memory layouts
GLOO-based CPU transfer	Create a GLOO process group between training and inference processes, broadcast/send weights through CPU memory (GPU→CPU on trainer side, CPU→GPU on inference side)	No disk I/O; direct inter-process communication; GLOO is well-supported in PyTorch	Requires establishing a cross-process group between the training engine and the SGLang inference server, which runs in a separate process with its own init; double memory copy (GPU→CPU→GPU) with CPU memory pressure for large models; significant plumbing to bridge AReaL's distributed groups with SGLang's internal process model; higher code complexity and maintenance burden
Disk-based sync (chosen)	Leverage existing `save_to_disk` + `update_weights_from_disk` path already used by the colocated engine	Minimal code changes; reuses battle-tested paths; easy to reason about correctness; no new dependencies	Disk I/O overhead (acceptable for on-policy where sync happens once per epoch)

Why disk-based?

Minimal invasiveness: The colocated engine already has prepare_for_inference
which saves weights to disk and loads them into the inference engine. On-policy
mode simply ensures this happens synchronously at the right time.
Correctness confidence: Reusing existing, well-tested weight sync paths
reduces the risk of subtle bugs (e.g., partial weight updates, memory leaks).
Performance is acceptable: For on-policy training, the weight sync happens
once per PPO epoch. The disk I/O cost (~seconds) is negligible compared to
the full rollout + training step duration.
Future optimization path: If disk I/O becomes a bottleneck (e.g., very
large models or frequent syncs), we can later explore new mechanisms without changing the on-policy
control flow.

Key changes:

areal/infra/colocated.py (new): ColocatedOrchestrator and ColocatedConfig that manage GPU ownership switching between training and inference engines, including idempotent prepare_for_training() / prepare_for_inference() transitions and direct disk-based weight updates bypassing name_resolve coordination.
areal/api/cli_args.py: Added colocated and colocated_weight_path fields to PPOActorConfig with validation (must use weight_update_mode='disk').
areal/api/io_struct.py: Added WeightUpdateMeta.from_colocated_disk() factory method for creating ephemeral, local-disk-based weight update metadata.
areal/infra/__init__.py: Exported ColocatedConfig and ColocatedOrchestrator.
areal/infra/launcher/local.py: Detect colocated mode, roll back GPU counter so trainer reuses inference server GPUs, and inject TMS environment variables.
areal/trainer/rl_trainer.py: Integrated colocated orchestration into PPOTrainer — initialization, train/inference GPU switching, stats snapshot before GPU handoff, recover checkpoint handling, and safe teardown (onload actor before destroy to avoid TMS invalid free).
areal/utils/stats_tracker.py: When reduce_group=None (local-only export), force tensor creation on CPU to avoid unnecessary CUDA dependency.
examples/math/gsm8k_grpo_colocated.yaml (new): Example config for GSM8K GRPO with colocated training on 8 GPUs.
tests/test_colocated_engine.py (new): Unit tests for ColocatedOrchestrator lifecycle, WeightUpdateMeta.from_colocated_disk, and PPOTrainer colocated config validation.

Experiment Results

Verified on GSM8K GRPO with Qwen2.5-1.5B, single node 8×GPU, using python3 -m areal.infra.launcher.local examples/math/gsm8k_rl.py --config examples/math/gsm8k_grpo_colocated.yaml.

mode	max task_reward
Standard	0.79570
Colocated (this PR)	0.80180

Key observations:

Colocated mode achieves comparable/better eval reward.
End-to-end training completed successfully for 10 epochs with no divergence.
Weight sync overhead (timeperf/update_weights) averages ~3s per step.

Related Issue

Fixes #992

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not
work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with
jb build docs
No critical issues raised by AI reviewers (/gemini review)

gemini-code-assist · 2026-03-14T09:39:26Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant feature that enables GPU time-sharing for on-policy training, allowing both the training and inference components to operate on the same set of GPUs. This is achieved through a novel orchestration mechanism that manages the dynamic offloading and onloading of models, coupled with an efficient disk-based weight synchronization method. The change aims to optimize GPU resource utilization, particularly in single-node environments, by reducing the need for separate GPU allocations for training and inference, thereby improving overall efficiency and potentially reducing operational costs.

Highlights

Colocated Training Mode: Introduced a new colocated (GPU time-sharing) on-policy training mode, allowing the training and inference engines to share the same GPUs by alternating between offloaded/onloaded states via torch_memory_saver.
Weight Transfer Mechanism: Implemented a fast, local disk-based weight transfer mechanism (typically using /dev/shm) for in-memory synchronization between the training and inference engines in colocated mode.
Orchestration and Configuration: Added ColocatedOrchestrator and ColocatedConfig to manage GPU ownership switching, idempotent state transitions, and direct disk-based weight updates, bypassing name_resolve coordination.
Trainer Integration: Integrated the colocated orchestration into PPOTrainer, including initialization, GPU switching during training/inference cycles, stats snapshotting, checkpoint recovery handling, and safe teardown procedures.
Validation and Examples: Included comprehensive validation for colocated mode configurations and provided a new example configuration (gsm8k_grpo_colocated.yaml) for GSM8K GRPO with colocated training, along with dedicated unit tests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

areal/api/cli_args.py
- Added colocated and colocated_weight_path fields to PPOActorConfig to enable and configure the new GPU time-sharing mode.
- Implemented validation in PPOActorConfig to enforce weight_update_mode='disk' when colocated mode is enabled.
areal/api/io_struct.py
- Added a new factory method WeightUpdateMeta.from_colocated_disk() for creating ephemeral, local-disk-based weight update metadata specifically for colocated mode.
areal/infra/init.py
- Exported ColocatedConfig and ColocatedOrchestrator to make them accessible within the areal.infra package.
areal/infra/colocated.py
- Added a new file defining ColocatedConfig for configuration parameters related to colocated training.
- Added a new file defining ColocatedOrchestrator to manage the GPU time-sharing lifecycle between training and inference engines, including offload/onload operations and direct disk-based weight updates.
areal/infra/launcher/local.py
- Modified local_main to detect colocated mode and ensure torch_memory_saver environment variables are injected.
- Adjusted the GPU counter to allow the trainer to reuse inference server GPUs in colocated mode.
areal/trainer/rl_trainer.py
- Imported nullcontext and torch_memory_saver for conditional memory management.
- Refactored PPOTrainer initialization to conditionally set up actor, critic, ref, and teacher based on colocated mode.
- Integrated ColocatedOrchestrator into PPOTrainer for managing GPU ownership and weight synchronization.
- Updated checkpoint recovery logic to handle the offloaded state of the actor in colocated mode.
- Modified the train loop to incorporate GPU switching logic (prepare_for_training, prepare_for_inference) and conditional pausing of rollout.
- Adjusted weight saving logic to use actor.save directly in colocated mode for disk-based transfers.
- Introduced methods _capture_train_stats_snapshot and _export_eval_stats_snapshot for managing statistics logging during GPU handoffs.
- Ensured the actor is onloaded before destruction in close method for safe teardown in colocated mode.
- Updated _init_rollout to explicitly allow LoRA in colocated mode.
- Expanded _validate_cfg with specific checks for colocated mode, such as requiring SPMD, single-node, and disallowing online training, critic, ref, or teacher models.
areal/utils/stats_tracker.py
- Modified _aggregate to force tensor creation on CPU when reduce_group is None for local-only statistics export.
examples/math/gsm8k_grpo_colocated.yaml
- Added a new example configuration file for GSM8K GRPO demonstrating colocated training with shared GPUs and disk-based weight updates.
- Configured enable_offload: true and actor.colocated: true.
- Removed ref model and enabled enable_memory_saver for SGLang/vLLM in the example configuration.
tests/test_colocated_engine.py
- Added a new file containing unit tests for ColocatedConfig and ColocatedOrchestrator functionality.
- Included tests for WeightUpdateMeta.from_colocated_disk factory method.
- Provided tests for PPOTrainer's validation rules specific to colocated mode.

Activity

No human activity to report on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature: colocated on-policy training, allowing the training and inference engines to share GPUs. The implementation is well-structured, with a new ColocatedOrchestrator to manage the GPU ownership lifecycle. The changes are extensive, touching configuration, I/O structures, the local launcher, and the main PPOTrainer. The integration into PPOTrainer is particularly complex, handling initialization, state transitions during the training loop, and checkpoint recovery, but it appears to be handled correctly. The addition of a new example configuration and comprehensive unit tests is commendable. I have one high-severity finding regarding an unused and flawed public method in the new orchestrator. Otherwise, the changes look solid.

areal/infra/colocated.py

HT-Yuan · 2026-03-17T03:18:22Z

@garrett4wade
Hello, sorry to bother you.

As mentioned in issue #992 , this PR aims to add colocated on-policy training support for AreaL.
Could you kindly let me know what additional work would be needed to get this PR ready for merging? For example, documentation, more detailed unit tests, or any other improvements you suggest.

Thank you very much for your time and help. Have a great day!

garrett4wade

Hi @HT-Yuan , thank you for the contribution. We appreciate your efforts, but the current implementation looks a little bit ad-hoc. We can do it in a more elegant way.

garrett4wade · 2026-03-17T03:26:47Z

areal/api/cli_args.py

+    # Colocated (GPU time-sharing) mode
+    colocated: bool = field(
+        default=False,
+        metadata={
+            "help": "Enable colocated mode where training and inference share the same GPUs. "
+            "When enabled, training and inference alternate via offload/onload with "
+            "weights transferred through a local disk path (e.g. /dev/shm)."
+        },
+    )
+    colocated_weight_path: str = field(
+        default="/dev/shm/areal_colocated_weights",
+        metadata={
+            "help": "Base path for temporary weight storage in colocated mode. "
+            "Defaults to /dev/shm for fast in-memory transfer. "
+            "Only effective when colocated=True."
+        },
+    )


colocated or not should be determined by scheduling_strategy.

actor: ... rollout: scheduling_strategy: type: collocation target: actor

This implies colocation.

In addition, the weight synchronization path is auto determined by cluster.fileroot.

No need to add additional fields in the config.

garrett4wade · 2026-03-17T03:27:25Z

areal/api/io_struct.py

+    @classmethod
+    def from_colocated_disk(
+        cls,
+        weight_path: str = "/dev/shm/areal_colocated_weights",
+        use_lora: bool = False,
+        lora_name: str = "",
+        lora_int_id: int = 1,
+        base_model_name: str = "",
+    ) -> "WeightUpdateMeta":


IMO there's no difference from ordinary disk-based weight update?

garrett4wade · 2026-03-17T03:30:09Z

areal/infra/launcher/local.py

-    if config.get("enable_offload", False):
+    # Detect colocated mode: training and inference share the same GPUs.
+    is_colocated = config.get("actor", {}).get("colocated", False)
+
+    if is_colocated or config.get("enable_offload", False):


Launcher is not the recommended usage and will be deprecated in the future. The current launch process is training script (trainer, local python file) -> scheduler submitting remote processes -> remote worker process (areal/infra/rpc/rpc_server.py).

To enable colocation, we should modify the arguments of scheduler.create_workers instead. It has also been implemented. Use proper scheduling_strategy for the job can enable colocation.

garrett4wade · 2026-03-17T03:32:52Z

areal/trainer/rl_trainer.py

+                tms_ctx: Any | nullcontext[None] = (
+                    torch_memory_saver.disable() if self._colocated else nullcontext()
                )
+                with tms_ctx:
+                    rollout_batch = self.actor.prepare_batch(
+                        self.train_dataloader,
+                        workflow=workflow,
+                        workflow_kwargs=workflow_kwargs,
+                        should_accept_fn=dynamic_filter_fn,
+                        group_size=config.gconfig.n_samples,
+                        dynamic_bs=self.config.dynamic_bs,
+                    )
+                    if self._colocated:
+                        self.rollout.pause()


self.actor and self.rollout, which should be FSDPEngine and RemoteSGLangEngine for example, have implemented the offload method. You can call self.actor.offload() to offload its parameter. What we should do is to add a proper context (and merge this context with the above timing context to avoid additional indentation) as an additional engine method. This would be more elegant.

garrett4wade · 2026-03-17T03:35:32Z

I just provided some style suggestions. It would also be great if you can self-review the PR with the /review-pr command, which will get you into some details that may be missed by humans.

HT-Yuan · 2026-03-17T03:48:03Z

@garrett4wade Thank you very much for your advice. I will refer to it to improve my implementation and hope to make AReal even better. Wish you a pleasant workday again.

HT-Yuan · 2026-03-19T11:51:18Z

@garrett4wade I have noticed the implementation and discussion of #999, and I believe update_weight_for_tensor is the better choice.
This PR explored a GPU time‑sharing approach based on disk weight offload/onload orchestration for colocated on‑policy training. It appears to be a different direction from the preferred implementation, so I will close this PR here.
Thanks for the feedback.
Best regards with your work.

gemini-code-assist bot reviewed Mar 14, 2026

View reviewed changes

areal/infra/colocated.py Outdated Show resolved Hide resolved

garrett4wade reviewed Mar 17, 2026

View reviewed changes

HT-Yuan changed the title ~~feat: support colocated on-policy training~~ [WIP]feat: support colocated on-policy training Mar 17, 2026

HT-Yuan marked this pull request as draft March 17, 2026 07:22

HT-Yuan added 4 commits March 17, 2026 15:53

feat: support colocated on-policy training

360ac0b

refactor: remove unused sync_weights_to_inference method

1ee1e56

wip: save current colocated on-policy changes

c115f3a

wip:refator

f5536f0

HT-Yuan force-pushed the feat/support_colocated_on_policy branch from 12a47d9 to f5536f0 Compare March 17, 2026 18:38

HT-Yuan closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]feat: support colocated on-policy training#1035

[WIP]feat: support colocated on-policy training#1035
HT-Yuan wants to merge 4 commits intoinclusionAI:mainfrom
HT-Yuan:feat/support_colocated_on_policy

HT-Yuan commented Mar 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

HT-Yuan commented Mar 17, 2026

Uh oh!

garrett4wade left a comment

Uh oh!

garrett4wade Mar 17, 2026

Uh oh!

garrett4wade Mar 17, 2026

Uh oh!

garrett4wade Mar 17, 2026

Uh oh!

garrett4wade Mar 17, 2026

Uh oh!

garrett4wade commented Mar 17, 2026

Uh oh!

HT-Yuan commented Mar 17, 2026

Uh oh!

HT-Yuan commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HT-Yuan commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Alternatives Considered

Why disk-based?

Key changes:

Experiment Results

Related Issue

Type of Change

Checklist

Uh oh!

gemini-code-assist bot commented Mar 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

HT-Yuan commented Mar 17, 2026

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

garrett4wade Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

garrett4wade Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

garrett4wade Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

garrett4wade Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

garrett4wade commented Mar 17, 2026

Uh oh!

HT-Yuan commented Mar 17, 2026

Uh oh!

HT-Yuan commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HT-Yuan commented Mar 14, 2026 •

edited

Loading