Skip to content

Accelerate and torchrun launcher support#10

Open
bghira wants to merge 1 commit into
lodestone-rock:mainfrom
bghira:feature/hf-accelerate-multigpu
Open

Accelerate and torchrun launcher support#10
bghira wants to merge 1 commit into
lodestone-rock:mainfrom
bghira:feature/hf-accelerate-multigpu

Conversation

@bghira

@bghira bghira commented Nov 26, 2025

Copy link
Copy Markdown

what & why

  • make ramtorch work under torchrun/accelerate where workers don’t inherit the parent’s shared CPU tensors (no more relying on fork/vfork side-effects for sharing).
  • keep ramtorch params in sync even when shared storage can’t be used.
  • add a non-cuda fallback path so linear doesn’t explode on mps/cpu (still offload-less, but functions adequately for testing spawn behaviour)

change details

  • new attach_shared_ramtorch_parameters(model, process_group=None): rank0 shares CPU storages, broadcasts handles, other ranks rebind their ramtorch params to the shared storage; barrier to settle. preserves single-host-copy behavior without a parent-process fork.
  • broadcast_zero_params(..., include_ramtorch=False): optional sync of ramtorch params when sharing isn’t available (correctness over memory dedup).
  • linear: pick device in order cuda→mps→cpu; skip pin_memory when cuda isn’t there; add synchronous fwd/bwd path for non-cuda devices.

the mps-compatible path is added so that development on ramtorch can be done even when on unified architecture.

i've got a monkeypatch version of this for my trainer that patches ramtorch at runtime, so if you're not comfortable including these changes, it's not a big deal - but it would be very nice to support more broad adoption of ramtorch for multigpu/multinode training.

@lodestone-rock

Copy link
Copy Markdown
Owner

im still working on torch run stuff, because if you call torch run naively the state wont get shared and you will end up with duplicate copies and non of the state are updating each other.

i don't think it can be run natively using torch run tbh because you have to run it as a spawned child to shared common CPU buffer

ideally we want more elegant solution than just raw bypass 🤔

@bghira

bghira commented Nov 28, 2025

Copy link
Copy Markdown
Author

yes, it's using shared memory handles to pass the data to the subsequent ranks, and only falls back to the inelegant approach if SHM isn't available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants