Support async mode for shm allreduce #484

gaopengff · 2026-01-20T02:49:16Z

This is to fix CI failure of pytorch in bump PR pytorch/pytorch#172297.
In async mode shmData should be occupied exclusively. We added lock for shmData to make it thread safe and used unique tag to do synchronization among different ranks.

gaopengff · 2026-01-20T03:00:39Z

@d4l3k Could you help review this?

d4l3k

LGTM

d4l3k · 2026-02-06T01:10:48Z

gloo/context.h


  std::shared_ptr<AllreduceSharedMemoryData> shmData;

+  std::mutex shmDataMutex;


What happens if there's multiple gloo process groups? Does that cause issues at all?

also can we put this under shmData?

For multiple process groups scenario, I used gloo context's address to generate unique ID for shm name. In that case different group will use different shm buffer to do allreduce op. I've verified with a test with multiple process groups with pytorch and it passed.

I think we could not put this under shmData. In the first run shmData is not initialized(nullptr), if there are multiple threads reaching this point, we need to ensure the initialization work is done only by one thread here.

meta-cla bot added the CLA Signed label Jan 20, 2026

gaopengff mentioned this pull request Jan 20, 2026

third-party/gloo: bumped submodule version to support shared-memory allreduce pytorch/pytorch#172297

Open

support async mode of torch for shm allreduce

1f5e2db

d4l3k approved these changes Feb 6, 2026

View reviewed changes

d4l3k reviewed Feb 6, 2026

View reviewed changes

gaopengff added 2 commits February 6, 2026 15:08

Merge branch 'main' into gaopengf/support_torch_async

fda784d

generate unique shm buffer for each group

874795c

d4l3k mentioned this pull request Feb 7, 2026

Revert "Intra-node shared memory (SHM) optimizations for CPU primitives (#458)" #490

Merged

gaopengff requested a review from d4l3k February 9, 2026 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support async mode for shm allreduce #484

Support async mode for shm allreduce #484

gaopengff commented Jan 20, 2026

Uh oh!

gaopengff commented Jan 20, 2026

Uh oh!

d4l3k left a comment

Uh oh!

d4l3k Feb 6, 2026

Uh oh!

d4l3k Feb 6, 2026

Uh oh!

gaopengff Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		std::shared_ptr<AllreduceSharedMemoryData> shmData;

		std::mutex shmDataMutex;

Support async mode for shm allreduce #484

Are you sure you want to change the base?

Support async mode for shm allreduce #484

Conversation

gaopengff commented Jan 20, 2026

Uh oh!

gaopengff commented Jan 20, 2026

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

d4l3k Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gaopengff Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants