Skip to content

Conversation

@sunxxuns
Copy link

@sunxxuns sunxxuns commented Jan 22, 2026

No description provided.

@sunxxuns sunxxuns changed the title amd-demo: MI350 optimized kernels benchmark [test] amd-demo: MI350 optimized kernels benchmark Jan 22, 2026
- Aiter Flash Attention for AMD GPUs
- Triton kernels for RMSNorm, GELU+Mul
- Full policy inference: 142ms latency, 7Hz (Pi0 3.5B, batch=1)
- 8-GPU DDP training: 407 samples/s (3.3B model)
- Training convergence verified
- Perfetto traces included
@sunxxuns sunxxuns closed this Jan 23, 2026
sunxxuns pushed a commit to sunxxuns/openpi-1 that referenced this pull request Feb 2, 2026
Benchmark results comparing NVIDIA H200 vs AMD MI350 (PR Physical-Intelligence#858):

Inference (Pi0 3.5B, batch=1):
- H200: 118.5-120.8 ms (8.28-8.44 Hz)
- MI350: 142.0 ms (7.04 Hz)
- H200 is ~15-17% faster

8-GPU DDP Training (3.3B model):
- H200 (SDPA): 320 samples/s best
- MI350 (Aiter): 407 samples/s best
- MI350 is ~21-27% faster

Files added:
- BENCHMARK_H200.md: Full comparison results
- scripts/benchmark_h200_ddp.py: DDP training benchmark (eager)
- scripts/benchmark_h200_ddp_sdpa.py: DDP training benchmark (SDPA)

Trace files generated but not committed (too large):
- h200_inference.json (280 MB)
- h200_ddp_*_rank[0-7].json (~41-46 MB each)
sunxxuns pushed a commit to sunxxuns/openpi-1 that referenced this pull request Feb 9, 2026
Benchmark results comparing NVIDIA H200 vs AMD MI350 (PR Physical-Intelligence#858):

Inference (Pi0 3.5B, batch=1):
- H200: 118.5-120.8 ms (8.28-8.44 Hz)
- MI350: 142.0 ms (7.04 Hz)
- H200 is ~15-17% faster

8-GPU DDP Training (3.3B model):
- H200 (SDPA): 320 samples/s best
- MI350 (Aiter): 407 samples/s best
- MI350 is ~21-27% faster

Files added:
- BENCHMARK_H200.md: Full comparison results
- scripts/benchmark_h200_ddp.py: DDP training benchmark (eager)
- scripts/benchmark_h200_ddp_sdpa.py: DDP training benchmark (SDPA)

Trace files generated but not committed (too large):
- h200_inference.json (280 MB)
- h200_ddp_*_rank[0-7].json (~41-46 MB each)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant