- Dispatch and Combine with PyTorch Symmetric Memory.
- An optimized implementation featuring memory-pool reuse and zero-copy paths.
- Benchmarked against host-initiated EP (NCCL), with a side-by-side Nsys profile comparison.
WIP Code cleanup and writeup in progress
Early Results on 8xH100 (sxm5), 1-layer MoE Transformer Layers.
Observe ranges of fwd, bwd, spot dispatch & combine.
NCCL-EP (dispatch.forward is wide (long) enough to be visible)
SymmMem-EP (dispatch.forward is harder to spot since it is compressed)
References:


