[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context#5881
[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context#5881hujc7 wants to merge 1 commit into
Conversation
Most test callers pass both ``sim_cfg=`` and ``device=`` to
:func:`isaaclab.sim.build_simulation_context`, implicitly expecting the
``device`` kwarg to win. The helper previously dropped the kwarg silently
when ``sim_cfg`` was provided, causing warp kernel-launch device
mismatches on non-default GPUs: the test fixture allocated ``env_ids``
on the requested device while the articulation's ``self.device``
resolved from the untouched ``sim_cfg`` default (``cuda:0``), and
``wp.launch(..., device=self.device)`` failed with::
RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
trying to launch on device='cuda:0',
but input array for argument 'env_ids' is on device=cuda:2.
Change ``device``'s default to ``None`` (sentinel) and apply it as an
override after sim_cfg construction in both branches. The one test that
asserted the old "sim_cfg overrides everything" contract is updated to
cover the new override semantics.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR fixes build_simulation_context silently dropping the device kwarg when sim_cfg is also provided, which caused warp kernel-launch device mismatches on non-default GPUs. The fix changes device's default from "cuda:0" to None (sentinel) and applies it as an override on sim_cfg.device when explicitly passed. The approach is correct and well-tested.
Design Assessment
Design is sound. Using None as a sentinel to distinguish "caller explicitly chose a device" from "caller did not specify" is the standard Python pattern for this problem. Applying the override after sim_cfg resolution (whether freshly built or passed in) correctly handles both branches with a single if device is not None check. This is minimal, targeted, and preserves backward compatibility for all callers.
Findings
🟡 Warning: In-place mutation of caller-provided sim_cfg — source/isaaclab/isaaclab/sim/simulation_context.py:1001
When a caller passes their own sim_cfg object and a device kwarg, the code mutates the caller's config in-place (sim_cfg.device = device). If a caller reuses the same SimulationCfg instance across multiple calls (e.g., in a test loop with different devices), the object will be permanently modified after the first call. This is visible in the test itself — the second build_simulation_context(sim_cfg=cfg, device="cpu") block mutates cfg.device to "cpu".
In practice, this is unlikely to cause issues because:
- The context manager yields then cleans up, so the mutated config is typically not reused afterward.
- Existing callers in the codebase construct
SimulationCfgfresh or don't reuse after the context.
However, documenting this behavior in the docstring (e.g., "Note: when device is provided alongside sim_cfg, sim_cfg.device is modified in-place") would prevent future surprises. Alternatively, a defensive copy.copy(sim_cfg) before mutation would make the API side-effect-free, but that may be over-engineering for a test helper.
🔵 Suggestion: Consider documenting the mutation side-effect — source/isaaclab/isaaclab/sim/simulation_context.py:971
Adding a brief note in the Args docstring would make the behavior explicit:
device: Device to run the simulation on. When given alongside ``sim_cfg``,
overrides ``sim_cfg.device`` **in-place** so the caller's explicit choice wins.
Test Coverage
- Bug fix: Yes — regression test directly exercises the fixed code path (
build_simulation_context(sim_cfg=cfg, device="cpu")assertssim.cfg.device == "cpu"). - New code: Yes — both the sim_cfg-only path and the device-override path are tested.
- Gaps: None critical. The existing parametrized
test_build_simulation_context_no_cfgtest withdevice=["cuda:0", "cpu"]continues to verify the no-cfg path works correctly with the new sentinel default.
CI Status
Core checks (pre-commit, build wheel, changelog fragments, labeler) are passing. Docker/installation/docs builds are pending — these are infrastructure checks unrelated to this code change.
Verdict
Minor fixes needed
The implementation correctly solves the device-mismatch bug with a clean, minimal approach. The only concern is the undocumented in-place mutation of caller-provided sim_cfg, which is a minor API hygiene issue rather than a correctness problem. The test coverage adequately exercises both code paths and would catch regressions.
Per per-PR minimum-needed analysis: - isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have; isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot). Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in AppLauncher; drops the workflow env var. - isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of this diagnostic bundle to test whether isaac-sim#5933 alone is enough for newton test_articulation (which calls build_simulation_context(sim_cfg=, device=) at line 2427, so still needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton test_articulation kitless conversion and the schemas.py _create_fixed_joint_to_world helper. Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit XML path-collision fix in conftest. If green, confirms only 4 PRs are needed for multi-GPU CI green (with test_articulation un-gated).
[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context
Summary
build_simulation_context(device=cuda:N, sim_cfg=...)silently dropped the explicitdevicekwarg when asim_cfgwas passed, falling back tosim_cfg.device(defaultcuda:0).ISAACLAB_SIM_DEVICE=cuda:Nper shard, so tests that passdevice="cuda:N"gotcuda:0instead. Downstream warp kernels then ran oncuda:0while the rest of the test believed it was oncuda:N, producing:device's default toNone(sentinel) and apply it as an override aftersim_cfgconstruction. The one test that asserted the old "sim_cfg overrides everything" contract is updated to cover the new override semantics.1. The bug
When a caller passed both
sim_cfg=NEWTON_VBD_CFG(with cfg.device defaulted tocuda:0) ANDdevice="cuda:2", the device kwarg was thrown away. Downstream code that pulled the active device fromsim_cfg.devicesawcuda:0; warp arrays allocated against the cfg device ended up oncuda:0, while torch ops driven by the kwarg device ran oncuda:2. Hence the cross-device kernel launch error.2. Fix
device=None(default) means "use whatever cfg already has".device="cuda:N"is honored even when a cfg is also passed.3. Validation
Local A/B on Horde MIG:
build_simulation_context(sim_cfg=cfg, device="cuda:2")previously failed the kernel-launch assertion; with the fix it passes.source/isaaclab/test/sim/test_build_simulation_context_{headless,nonheadless}.py::test_build_simulation_context_cfgupdated to assert the new override semantics.4. Why this is
[MGPU]-taggedThe bug only fires under multi-GPU scenarios where the caller actually wants cuda:1+. Single-GPU CI sets device implicitly to
cuda:0and never hits the mismatch. Companion to the multi-GPU CI in #5875, and independent of #5883 / #5886.