[Question] Multi-Node Distributed RL Validation Status on DGX Spark

Hi everyone,

I run a robotics company developing autonomous UGV/rover-class robots that operate in GPS-denied environments. Each robot carries multiple camera feeds, LiDAR, depth sensors, thermal imagers, and environment detection models running on Jetson Orin Nano Super at the edge.

We’re designing our training infrastructure and would love feedback from anyone who has deployed Isaac Sim and Isaac Lab at scale.

Proposed Setup

- 4× DGX Spark (expanding to 8× later) connected via Mikrotik CRS804-DDQ (4× 400GbE, supports up to 8× Sparks from a single switch)
- 1× RTX 5090 workstation dedicated to Isaac Sim GUI authoring (environment design, sensor rig setup)
- Jetson Orin Nano Super on each robot for edge inference
- Sim-to-real pipeline: author in Isaac Sim → train headless on Spark cluster → validate HIL → deploy to Orin

Following the review on the NVIDIA Developer Forums for this reinforcement learning cluster design, the moderator recommended opening an inquiry here to assess the current validation status and software compatibility regarding Isaac Lab’s multi-node distributed training execution on the NVIDIA DGX Spark platform.

https://forums.developer.nvidia.com/t/dgx-spark-cluster-4-8-nodes-running-isaac-sim-isaac-lab-for-autonomous-robot-training/370549

Questions for the Isaac Lab Team:

1. Multi-Node Execution Status: Has Isaac Lab’s multi-node distributed training module (source/standalone/workflows/rsl_rl/train.py distributed) been actively tested and validated on DGX Spark environment?
2. Unified Memory Scaling & Performance Overhead: Under heavy multi-node scaling, the system's coherent unified memory handles both the local simulation loops (PhysX and RTX ray-tracing for sensor generation) and PyTorch distributed task orchestration simultaneously. Are there any known memory system overhead bottlenecks or performance degradation flags during NCCL multi-node gradient reduction passes under this specific architecture?
3. CUDA 13+ / PyTorch Dependency Baseline: Given that the DGX Spark platform maps to newer CUDA 13+ software baselines, do the repository's installation scripts cleanly target and compile extensions (e.g., rsl_rl or Isaac Lab bindings) against cu13 PyTorch wheels on Spark, or should we anticipate manual packaging workarounds?
4. Native arm64 Container Multi-Stage Stability: Are the repository's native multi-stage Docker build recipes stable when building and executing directly on a native arm64 container deployment engine?

We are looking to catch potential toolchain or multi-node execution hurdles early in our physical AI deployment pipeline. Any insights or scaling guidance would be highly valuable!
Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Multi-Node Distributed RL Validation Status on DGX Spark #5707

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Question] Multi-Node Distributed RL Validation Status on DGX Spark #5707

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions