Skip to content

[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context#5881

Draft
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/fix-build-sim-context-device
Draft

[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context#5881
hujc7 wants to merge 1 commit into
isaac-sim:developfrom
hujc7:jichuanh/fix-build-sim-context-device

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented May 30, 2026

[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context

Summary

  • build_simulation_context(device=cuda:N, sim_cfg=...) silently dropped the explicit device kwarg when a sim_cfg was passed, falling back to sim_cfg.device (default cuda:0).
  • Multi-GPU CI workflows set ISAACLAB_SIM_DEVICE=cuda:N per shard, so tests that pass device="cuda:N" got cuda:0 instead. Downstream warp kernels then ran on cuda:0 while the rest of the test believed it was on cuda:N, producing:
    RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
    trying to launch on device='cuda:0',
    but input array for argument 'env_ids' is on device=cuda:2.
    
  • Fix: change device's default to None (sentinel) and apply it as an override after sim_cfg construction. The one test that asserted the old "sim_cfg overrides everything" contract is updated to cover the new override semantics.

1. The bug

def build_simulation_context(
    create_new_stage: bool = True,
    gravity_enabled: bool = True,
    device: str = "cuda:0",       # <-- default here suppressed any explicit kwarg
    dt: float = 0.01,
    sim_cfg: SimulationCfg | None = None,
    ...
):
    if sim_cfg is None:
        sim_cfg = SimulationCfg(device=device, ...)
    # ^^ explicit device only used in the no-sim_cfg path; otherwise ignored

When a caller passed both sim_cfg=NEWTON_VBD_CFG (with cfg.device defaulted to cuda:0) AND device="cuda:2", the device kwarg was thrown away. Downstream code that pulled the active device from sim_cfg.device saw cuda:0; warp arrays allocated against the cfg device ended up on cuda:0, while torch ops driven by the kwarg device ran on cuda:2. Hence the cross-device kernel launch error.

2. Fix

def build_simulation_context(
    create_new_stage: bool = True,
    gravity_enabled: bool = True,
    device: str | None = None,    # sentinel
    dt: float = 0.01,
    sim_cfg: SimulationCfg | None = None,
    ...
):
    if sim_cfg is None:
        gravity = (0.0, 0.0, -9.81) if gravity_enabled else (0.0, 0.0, 0.0)
        sim_cfg = SimulationCfg(device=device or "cuda:0", dt=dt, gravity=gravity)
    elif device is not None:
        sim_cfg.device = device   # explicit kwarg wins

device=None (default) means "use whatever cfg already has". device="cuda:N" is honored even when a cfg is also passed.

3. Validation

Local A/B on Horde MIG: build_simulation_context(sim_cfg=cfg, device="cuda:2") previously failed the kernel-launch assertion; with the fix it passes.

source/isaaclab/test/sim/test_build_simulation_context_{headless,nonheadless}.py::test_build_simulation_context_cfg updated to assert the new override semantics.

4. Why this is [MGPU]-tagged

The bug only fires under multi-GPU scenarios where the caller actually wants cuda:1+. Single-GPU CI sets device implicitly to cuda:0 and never hits the mismatch. Companion to the multi-GPU CI in #5875, and independent of #5883 / #5886.

Most test callers pass both ``sim_cfg=`` and ``device=`` to
:func:`isaaclab.sim.build_simulation_context`, implicitly expecting the
``device`` kwarg to win. The helper previously dropped the kwarg silently
when ``sim_cfg`` was provided, causing warp kernel-launch device
mismatches on non-default GPUs: the test fixture allocated ``env_ids``
on the requested device while the articulation's ``self.device``
resolved from the untouched ``sim_cfg`` default (``cuda:0``), and
``wp.launch(..., device=self.device)`` failed with::

    RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
    trying to launch on device='cuda:0',
    but input array for argument 'env_ids' is on device=cuda:2.

Change ``device``'s default to ``None`` (sentinel) and apply it as an
override after sim_cfg construction in both branches. The one test that
asserted the old "sim_cfg overrides everything" contract is updated to
cover the new override semantics.
@github-actions github-actions Bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels May 30, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR fixes build_simulation_context silently dropping the device kwarg when sim_cfg is also provided, which caused warp kernel-launch device mismatches on non-default GPUs. The fix changes device's default from "cuda:0" to None (sentinel) and applies it as an override on sim_cfg.device when explicitly passed. The approach is correct and well-tested.

Design Assessment

Design is sound. Using None as a sentinel to distinguish "caller explicitly chose a device" from "caller did not specify" is the standard Python pattern for this problem. Applying the override after sim_cfg resolution (whether freshly built or passed in) correctly handles both branches with a single if device is not None check. This is minimal, targeted, and preserves backward compatibility for all callers.

Findings

🟡 Warning: In-place mutation of caller-provided sim_cfgsource/isaaclab/isaaclab/sim/simulation_context.py:1001

When a caller passes their own sim_cfg object and a device kwarg, the code mutates the caller's config in-place (sim_cfg.device = device). If a caller reuses the same SimulationCfg instance across multiple calls (e.g., in a test loop with different devices), the object will be permanently modified after the first call. This is visible in the test itself — the second build_simulation_context(sim_cfg=cfg, device="cpu") block mutates cfg.device to "cpu".

In practice, this is unlikely to cause issues because:

  1. The context manager yields then cleans up, so the mutated config is typically not reused afterward.
  2. Existing callers in the codebase construct SimulationCfg fresh or don't reuse after the context.

However, documenting this behavior in the docstring (e.g., "Note: when device is provided alongside sim_cfg, sim_cfg.device is modified in-place") would prevent future surprises. Alternatively, a defensive copy.copy(sim_cfg) before mutation would make the API side-effect-free, but that may be over-engineering for a test helper.

🔵 Suggestion: Consider documenting the mutation side-effectsource/isaaclab/isaaclab/sim/simulation_context.py:971

Adding a brief note in the Args docstring would make the behavior explicit:

device: Device to run the simulation on. When given alongside ``sim_cfg``,
    overrides ``sim_cfg.device`` **in-place** so the caller's explicit choice wins.

Test Coverage

  • Bug fix: Yes — regression test directly exercises the fixed code path (build_simulation_context(sim_cfg=cfg, device="cpu") asserts sim.cfg.device == "cpu").
  • New code: Yes — both the sim_cfg-only path and the device-override path are tested.
  • Gaps: None critical. The existing parametrized test_build_simulation_context_no_cfg test with device=["cuda:0", "cpu"] continues to verify the no-cfg path works correctly with the new sentinel default.

CI Status

Core checks (pre-commit, build wheel, changelog fragments, labeler) are passing. Docker/installation/docs builds are pending — these are infrastructure checks unrelated to this code change.

Verdict

Minor fixes needed

The implementation correctly solves the device-mismatch bug with a clean, minimal approach. The only concern is the undocumented in-place mutation of caller-provided sim_cfg, which is a minor API hygiene issue rather than a correctness problem. The test coverage adequately exercises both code paths and would catch regressions.

@hujc7 hujc7 changed the title [Sim] Honor device kwarg over sim_cfg in build_simulation_context [MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context Jun 2, 2026
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request Jun 3, 2026
Per per-PR minimum-needed analysis:
- isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have;
  isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot).
  Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in
  AppLauncher; drops the workflow env var.
- isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of
  this diagnostic bundle to test whether isaac-sim#5933 alone is enough for
  newton test_articulation (which calls
  build_simulation_context(sim_cfg=, device=) at line 2427, so still
  needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton
  test_articulation kitless conversion and the schemas.py
  _create_fixed_joint_to_world helper.

Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit
XML path-collision fix in conftest. If green, confirms only 4 PRs
are needed for multi-GPU CI green (with test_articulation un-gated).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant