[CI] Cross-platform — Part 4: Multi-GPU pytest workflow with cuda:1 coverage#5823
[CI] Cross-platform — Part 4: Multi-GPU pytest workflow with cuda:1 coverage#5823hujc7 wants to merge 19 commits into
Conversation
There was a problem hiding this comment.
Overall
This is a well-designed PR that introduces a clean, env-driven device parametrization helper for multi-GPU testing. The mask grammar is elegant and the documentation in the module docstring is excellent. The workflow setup and test migrations look correct.
However, there's a critical issue that will prevent the tests from running.
Critical: Missing __init__.py
The new isaaclab.testing package is missing its __init__.py file. Without it, Python won't recognize source/isaaclab/isaaclab/testing/ as a package, and all the imports like:
from isaaclab.testing import cuda_test_deviceswill fail with ModuleNotFoundError.
Fix: Add source/isaaclab/isaaclab/testing/__init__.py with at minimum:
# Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
"""Testing utilities for Isaac Lab."""
from .devices import cuda_test_devices
__all__ = ["cuda_test_devices"]Minor Suggestions
1. Workflow: Consider using matrix strategy for future extensibility
The pytest target list is explicit (good for control), but as you add more files in P1/P2, consider whether a matrix or glob pattern might be cleaner. Current approach is fine for now.
2. devices.py line 114: Consider caching available devices
Calling torch.cuda.device_count() on every test parametrization is cheap but redundant. A module-level cache (computed once at import) would be cleaner:
@functools.lru_cache(maxsize=1)
def _list_available_devices() -> tuple[str, ...]:
...(Return tuple for hashability, or just compute at module level.)
3. Workflow line 57: Pin Isaac Sim version handling
The fallback '6.0.0' is hardcoded. If vars.ISAACSIM_BASE_VERSION is the source of truth, consider failing explicitly when unset rather than silently falling back, to avoid version drift.
4. Consider adding a simple unit test for cuda_test_devices itself
A few tests for the helper (mask parsing, strict vs non-strict, edge cases like empty mask, invalid chars) would catch regressions in the helper logic itself.
Summary
- Blocking: Add the missing
__init__.pyforisaaclab.testing - Non-blocking: Minor polish suggestions above
Once the __init__.py is added, this looks good to merge. Nice work on the mask grammar design! 🎯
c0b18ca to
30667b8
Compare
There was a problem hiding this comment.
Overall Assessment
This is a well-designed PR that introduces an elegant, env-driven device parametrization system for multi-GPU testing. The mask grammar is intuitive (e.g., 110 → [cpu, cuda:0], 001 → [cuda:1]), the documentation is excellent, and the backward-compatible default ensures zero impact on existing single-GPU CI runs.
The PR successfully addresses:
- ✅ New
isaaclab.testingmodule withcuda_test_devices()helper - ✅ Auto-discovery workflow that picks up opt-in test files automatically
- ✅ Migration of 19 test files (~280 parametrize sites)
- ✅ Clear P1/P2 scope documentation for follow-ups
Findings
1. Missing Unit Tests for the Helper Function (Medium)
Location: source/isaaclab/isaaclab/testing/devices.py
The cuda_test_devices() function has non-trivial mask parsing logic with multiple edge cases (trailing X, strict vs non-strict, empty masks, invalid characters), but there are no dedicated unit tests for the helper itself.
Recommendation: Add a test file like source/isaaclab/test/testing/test_cuda_test_devices.py with cases covering:
- Default mask resolution
- Explicit mask parsing (
"110","001","00X","X") - Strict mode raising on unavailable devices
- Non-strict mode returning empty list gracefully
- Invalid mask characters raising
ValueError
This protects against regressions in the helper logic itself.
2. Workflow SKIP List Pattern Could Be More Robust (Low) ✅ RESOLVED
Resolved in f66b101 — The new include-files approach is cleaner and avoids the heredoc pattern entirely.
3. Consider Caching Device List (Nit)
Location: source/isaaclab/isaaclab/testing/devices.py line 113
_list_available_devices() is called on every cuda_test_devices() invocation. While torch.cuda.device_count() is cheap, caching the result with @functools.lru_cache would be cleaner and more efficient for test collection.
Summary
This is a solid foundation for multi-GPU testing infrastructure. The design choices are well-reasoned and the migration is clean.
Actionable items:
- (Recommended) Add unit tests for
cuda_test_devices()to catch edge-case regressions (Optional) Minor robustness improvements noted above✅ Resolved
Nice work on the mask grammar design! 🎯
Update (d6b2934): New commit adds explicit pip install pytest step to the workflow — a good defensive fix ensuring runner independence. No new issues. Previous suggestions remain optional improvements for future iterations.
Update (f1fa016): Good improvements in this commit:
- Extended pip install to include
pytest-mockandflakywith clear explanation - Added
is_ok()helper treating pytest exit code 5 (no tests collected) as success — correct for device-gated tests - Extended SKIP list with 4 newton tests and improved documentation
No new issues. Previous optional suggestions (unit tests for helper, robustness improvements) remain relevant but non-blocking.
Update (042c409): This commit expands the skip list with well-documented categories explaining why each test is skipped (FabricFrameView deadlocks, SIGSEGV failures, cuda:1 hangs, etc.). Good practice to document the reasoning. Also adds test_simulation_context.py and physx rigid-object tests to the skip list. No new issues — this is maintenance housekeeping for multi-GPU stability.
Update (1546520): 🎉 Excellent fix! This commit addresses the root cause of the cuda:1 test failures:
app_launcher.py: AddedISAACLAB_SIM_DEVICEenv var support to override the default device when not explicitly passed. Clean implementation with good documentation.- Workflow: Sets
ISAACLAB_SIM_DEVICE=cuda:1to boot Kit withactive_gpu=1from process start. This correctly targets PhysX/Warp to cuda:1 before SimulationApp locks it. - Result: 7 tests removed from the skip list and now run successfully on cuda:1!
This is a proper fix rather than just skipping problematic tests. No new issues introduced.
Update (feb8e21): Skip list maintenance — expanded documentation with categorized skip reasons:
- Upstream Kit limits (FabricFrameView)
- Pre-existing test/API breakage (newton contact_sensor)
- Newton cuda:1 — Warp allocator failure in mujoco_warp collision driver
- PhysX cuda:1 — hangs specific to AWS L40 runner (passes on other hardware)
Added 7 files back to skip list (4 Newton, 3 PhysX asset tests) with clear root-cause explanations. Good documentation practice for future debugging. No new issues.
Update (6eeda64): Added changelog entry (jichuanh-multi-gpu-ci.minor.rst) documenting the new cuda_test_devices() helper and ISAACLAB_SIM_DEVICE env var. Good housekeeping. No new issues.
Update (e3d328b): Major infrastructure improvement — switched from bare-runner installation to Docker-based execution:
- Added
configjob that loads isaacsim image config fromconfig.yamlvia sparse checkout - Added
ecr-build-push-pullaction to pull the same per-commit CI image used by regular CI (dep matrices now match exactly) - Tests run inside container with workspace volume-mounted, removing need for
pip installsteps on runner - Cleaned up SKIP list comments (removed obsolete contact_sensor reference)
- Proper symlink setup (
rm -f _isaac_sim && ln -s /isaac-sim _isaac_sim) inside container
This is a solid approach — reusing the regular CI's Docker image ensures environment consistency and removes runner-specific pip quirks. Good fallback documentation if ECR cache misses. No new issues.
Update (3c5248d): Good documentation and consistency improvements:
- Added clear comments explaining
--entrypoint bash(overrides isaac-sim base image's default ENTRYPOINT that would swallow the script) - Added explanation for
--ignore=tools/conftest.py(matches regular CI's behavior) - Consolidated ignore paths into
ignore_argsvariable and applied to both pure-python and per-file pytest runs - Cleaner shell structure:
--entrypoint bash+-cinstead of inlinebash -c
No new issues. Ready for merge. 🚀
Update (9ff909d): Fixes local image availability issue — when ecr-build-push-pull takes the deps-cache-hit path, it creates an ECR registry-side alias but leaves no image in local Docker. This commit adds an "Ensure image is local" step that:
- Creates isolated DOCKER_CONFIG with
credsStoredisabled (workaround for broken runner credential helper) - Re-authenticates to ECR via EC2 IAM role
- Explicitly pulls and tags the image for subsequent container steps
Good defensive fix with clear comments explaining the edge case. No new issues.
Update (6706690): Good fixes for container permission issues:
- Added
--user "$(id -u):$(id -g)"to run container as host uid:gid — fixes write permission failures on volume-mounted workspace - Created writable $HOME (
/tmp/isaaclab-ci-home) withXDG_CACHE_HOMEfor Warp/numpy/pip caches - Well-commented: explains the issue (image's default
isaaclabuid 1000 ≠ runner'sgithub-runneruser)
Proper solution for the permission mismatch. No new issues.
Update (328878a): Skip list expanded with ovphysx tests. Good documentation explaining the issue: module-level pytest.mark.skipif collects 0 items + 1 skip, which isaaclab.sh CLI wrapper interprets as exit 1 (doesn't distinguish from real failures). Added 3 ovphysx asset tests to skip list. Clean maintenance commit. No new issues.
Update (665f0c3): Added temporary run_docker_tests: 'false' override in build.yaml to free GPU runners during PR iteration. The TEMP comment is well-documented ("revert before final review").
Update (f66b101): 🎉 Excellent refactor! This commit makes major architectural improvements:
Changes:
- New
extra-env-varsinput added to bothrun-testsandrun-package-testsactions — clean mechanism to forward env vars into the container with proper parsing (handles comments, whitespace,KEY=valueformat) - Workflow refactored to use
run-package-testsaction instead of ~150 lines of inline bash. Much cleaner and more maintainable. - Separate
buildjob added to pre-populate the ECR image before test job runs (mirrors the build→test split inbuild.yaml) - Include-list approach replaces the SKIP list pattern — explicitly listing 9 known-working tests is safer than auto-discovery with exclusions
- Path triggers expanded to include action files, so workflow changes properly trigger CI
My previous Finding #2 (SKIP list robustness) is now obsolete — the new include-files approach is cleaner and avoids the heredoc pattern entirely. ✅
No new issues. This is a significant quality improvement that makes the multi-GPU workflow consistent with the rest of the CI infrastructure.
run_docker_tests: 'false' in build.yaml (from 665f0c3).
Update (3c985eb): Minor update for FabricFrameView retest iteration:
Changes in build.yaml:
- Updated TEMP comment: now focused on "FabricFrameView cuda:1 retest" (previously was "Multi-GPU pytest docker conversion")
- Streamlined comment (removed redundant "freeing the single-GPU runner pool" explanation)
run_docker_tests: 'false'remains set — still iterating
Changes in test-multi-gpu-pytest.yaml:
- Added
test_views_xform_prim_fabric.pyto the include-files list — attempting to enable the FabricFrameView test on cuda:1
This is a WIP commit testing whether the FabricFrameView deadlock issue (documented in earlier skip lists) has been resolved. run_docker_tests: 'false' still needs to be reverted before merge.
No new architectural issues. Just experimental test expansion.
Update (bdcb527): ✅ Final Cleanup Complete
The latest commit removes the temporary run_docker_tests: 'false' override that was used during iteration, restoring the proper change-detection logic:
- run_docker_tests: 'false'
+ run_docker_tests: ${{ steps.detect.outputs.run_docker_tests }}This was the expected final step — the temporary bypass was necessary to test the FabricFrameView cuda:1 retest path, and it's now correctly reverted.
Status: All outstanding items addressed. PR is ready for merge. 🚀
There was a problem hiding this comment.
Update (5d29bb0): ✅ Ready for Merge
The final commit reverts the temporary `run_docker_tests: 'false' override — the exact action my previous review flagged as blocking. All outstanding items are now addressed:
✅ Missing __init__.py — Added in earlier commit
✅ **run_docker_tests: 'false' reverted** — This commit ✅ **Workflow refactored** to use run-package-tests` action (f66b101)
✅ Include-list approach replaces SKIP heredoc pattern
✅ Changelog fragment added
Summary
This PR is now complete and ready for merge. The multi-GPU testing infrastructure is solid:
- Clean
cuda_test_devices()helper with elegant mask grammar ISAACLAB_SIM_DEVICEenv var for AppLauncher device override- Proper integration with existing CI via
run-package-testsaction - 9 test files migrated with known-working cuda:1 paths
No new issues introduced. Nice work! 🚀
Update (d6d69c4): ⚠️ Iteration Checkpoint — Not Merge-Ready
This commit re-introduces the temporary run_docker_tests: 'false' override in .github/workflows/build.yaml`. The PR author is iterating and has added a note:
"TEMP (revert before final review / before landing): force run_docker_tests=false while iterating PR #5823"
Status: The PR was previously ready for merge, but is now in an iteration state. This temporary flag must be reverted before merging — the author has clearly marked it as such.
📋 Blocking for merge:
- Revert `run_docker_tests: 'false' override (same fix as commit 5d29bb0)
No functional regressions otherwise — this is expected iteration behavior.
Update (ea919b7): Module Relocation
This commit relocates the testing utilities from isaaclab.testing to isaaclab.test.utils:
Changes:
- Renamed
source/isaaclab/isaaclab/testing/→source/isaaclab/isaaclab/test/utils/ - Updated imports across 16 test files in
isaaclab,isaaclab_newton,isaaclab_ovphysx, andisaaclab_physxpackages - Workflow path trigger updated from
source/isaaclab/isaaclab/testing/**→source/isaaclab/isaaclab/test/utils/** - Changelog fragment updated to reflect new import path
✅ This is a clean refactor that consolidates test utilities under the isaaclab.test namespace. All import statements are consistently updated.
📋 Blocking for merge:
- Revert `run_docker_tests: 'false' override (status unchanged from previous update)
Otherwise, no new issues introduced.
Update (0118ea7): 🎯 Nice Refinements
This commit adds several quality-of-life improvements:
Changes:
1. run-tests/action.yml — Smarter env var parsing
- Now preserves mid-line
#characters (e.g.,IMAGE_TAG=v1.0#nightlysurvives) - Only treats full-line comments (where first non-whitespace is
#) as comments - Strips YAML block indent whitespace properly
2. test-multi-gpu-pytest.yaml — Auto-discovery of opt-in tests
- Replaces hardcoded include-list with
grep -rl 'cuda_test_devices'discovery - Adding a new test to multi-GPU CI now requires no workflow edit — just use the helper
- Clean design: opt-in via code, not via workflow maintenance
3. cuda_test_devices() helper — New skip parameter
- Added
skip: dict[str, str]for per-device skips with reasons - Wraps skipped devices in
pytest.param(..., marks=pytest.mark.skip(reason=...)) - CI shows
SKIPPEDwith reason instead of silent omission - Changed
strictdefault fromTruetoFalse(friendlier for CPU-only dev machines)
4. test_views_xform_prim_fabric.py — Migrated to helper
- Replaced
@pytest.mark.parametrize("device", ["cpu", "cuda:0"])withcuda_test_devices() - Removed manual
ISAACLAB_TEST_MULTI_GPUskipif decorators - Uses
mask="001", strict=Falsefor cuda:1-only tests
✅ All changes are clean refinements that improve maintainability and developer experience.
📋 Blocking for merge:
- Revert
run_docker_tests: 'false'override (status unchanged)
Nice incremental polish! 🔧
Update (101a47a): 🔧 Newton Device-Pinning & Test Tolerance Fixes
This commit addresses cuda:1 context issues in Newton manager and fixes device-specific test tolerances:
Changes:
1. newton_manager.py — Device context pinning (issue #5132)
Added explicit torch.cuda.set_device() + wp.set_device() calls in two critical locations:
start_simulation()— Before Warp/Newton allocations to ensure mujoco_warp collision pipeline allocates on the correct CUDA deviceinitialize_solver()— Mirrors the guard fromstart_simulation()for solver build and collision-pipeline init
This fixes null pointer returns when mujoco_warp allocates against a primary CUDA context that was never made current. The guard is idempotent if already pinned.
2. wp.ScopedCapture(device=device) — Explicit device for CUDA graph capture
Previously, CUDA graph capture didn't specify a device; now explicitly passes device to avoid capturing on wrong GPU.
3. Test tolerance fixes (test_rigid_object.py in both newton & physx packages)
Changed if device == "cuda:0" to if device.startswith("cuda") for tolerance selection. This ensures cuda:1 (and any other CUDA device) uses the relaxed GPU tolerance (1e-2) instead of the strict CPU tolerance (1e-5).
✅ Solid fixes:
- Device-pinning guards reference the related issue (#5132)
- Comments explain why the guards are needed (mujoco_warp context behavior)
- Test tolerance fix is the correct pattern for any CUDA device
- CUDA graph capture now device-aware
📋 Blocking for merge:
- Revert
run_docker_tests: 'false'override (status unchanged)
Otherwise, these are important correctness fixes for multi-GPU setups. 👍
Greptile SummaryThis PR introduces a complete multi-GPU pytest CI lane: a
Confidence Score: 5/5Safe to merge; all findings are edge-case concerns that do not affect the normal single-GPU or multi-GPU test paths. The core logic — mask intersection, atomic work-queue claim, shard fan-out, Newton device pin, and build_simulation_context device-kwarg fix — is well-designed and backed by dedicated unit tests. The three flagged concerns (XR + ISAACLAB_SIM_DEVICE stale dict key, single-quoted discovery gap, theoretical slug collision) all require uncommon conditions to trigger and none affect the intended multi-GPU CI path or the single-GPU invariant. source/isaaclab/isaaclab/app/app_launcher.py (XR + env-device key sync) and .github/workflows/test-multi-gpu-pytest.yaml (discovery grep quote handling) are the two spots that could benefit from a quick follow-up, though neither blocks merging. Important Files Changed
Sequence DiagramsequenceDiagram
participant WF as workflow yaml
participant HL as host_launcher.sh
participant DC as Docker container
participant SR as shard_runner.sh
participant CF as conftest.py
participant Q as Work Queue
WF->>WF: "Discover opt-in test files"
WF->>HL: "bash multi_gpu_host_launcher.sh"
HL->>Q: "Seed queue/ with file slugs"
HL->>DC: "docker run --gpus all"
DC->>SR: "multi_gpu_shard_runner.sh"
SR->>SR: "nvidia-smi DEV_COUNT + torch cross-check"
loop "For each cuda:N (N=1..DEV_COUNT-1)"
SR->>CF: "pytest subprocess (ISAACLAB_SIM_DEVICE + ISAACLAB_TEST_DEVICES)"
CF->>Q: "claim: os.rename queue to inflight"
CF->>CF: "Run per-file pytest with mgpu_shard_select plugin"
CF->>Q: "done: os.rename inflight to done"
end
SR-->>DC: "Aggregate exit codes"
DC-->>HL: "docker exit code"
HL->>HL: "Reconciler: assert queue + inflight empty"
HL-->>WF: "exit (docker_rc or 2 on orphan)"
WF->>WF: "aggregate_test_summary.py to Step Summary"
Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'origin/dev..." | Re-trigger Greptile |
|
|
||
| Args: | ||
| mask: Optional explicit mask string following the grammar in the | ||
| module docstring. When ``None``, the helper reads the mask from | ||
| the ``ISAACLAB_TEST_DEVICES`` environment variable, defaulting | ||
| to ``"110"`` if the variable is unset. | ||
| strict: When ``True`` (the default), raise ``ValueError`` if the | ||
| mask requests devices the host does not have or if the resolved | ||
| list would be empty. When ``False``, silently truncate to what | ||
| the host can satisfy - callers using this in | ||
| ``pytest.mark.parametrize`` get a clean "no tests collected for |
There was a problem hiding this comment.
strict=True default raises at collection time on hosts without cuda:0
With the default mask "110" and strict=True, calling cuda_test_devices() on a host that reports only ["cpu"] (no CUDA) immediately raises ValueError during pytest collection — before any test runs. Before this change, the same tests (parametrized with ["cuda:0", "cpu"]) would collect normally; only the cuda:0 variant would fail at runtime. Tests like test_episode_data.py and test_math.py have no AppLauncher dependency, so they are plausible targets for collection on CPU-only dev machines. Passing strict=False to the default call-sites, or documenting the constraint, would preserve the old graceful-degrade behaviour for contributors without full GPU setups.
| # Caller-supplied extra env vars (one KEY=value per line). | ||
| if [ -n "$extra_env_vars" ]; then | ||
| while IFS= read -r line; do | ||
| line="${line%%#*}" | ||
| line="$(echo "$line" | xargs)" | ||
| [ -z "$line" ] && continue | ||
| key="${line%%=*}" |
There was a problem hiding this comment.
Comment-stripping and
xargs trimming silently corrupt values containing # or single quotes
line="${line%%#*}" strips everything from the first # in a line, so a future caller passing a value like IMAGE_TAG=v1.0#nightly would silently become IMAGE_TAG=v1.0. Similarly, $(echo "$line" | xargs) collapses internal whitespace and fails on values containing unbalanced single quotes. For the two values used today (001 and cuda:1) this is harmless, but it is a latent trap for any future extra-env-vars user. Documenting the # restriction in the input description, or switching to a safer stripping approach, would prevent silent data loss.
There was a problem hiding this comment.
Update (437a6d2): 🔧 Build Job Pool Alignment
New commit reviewed (9d159f3 → 437a6d2):
ECR Cache Locality Fix
# Before:
runs-on: [self-hosted, gpu]
# After:
runs-on: [self-hosted, linux, x64, multi-gpu]Problem solved: The ECR cache repo is resolved per runner pool:
gpupool →gitci-docker-cachemulti-gpupool →multigpu-docker-cache
Building on [self-hosted, gpu] pushed the image to gitci-docker-cache, which the multi-GPU test job cannot see, causing a full rebuild (~27 min) on the scarce multi-GPU runner.
Fix: Align the build job to the same pool (multi-gpu) as the test job so both hit the same cache. This is the correct approach — build and test should share the same ECR namespace.
✅ LGTM for this commit. Clean infrastructure fix with excellent inline documentation explaining the cache topology.
📋 Status Summary:
- Previous reviews covered: runner label fix (
gpuremoval), script extraction, work-queue architecture - No new concerns in this commit
📜 Previous review history
Update (9d159f3): Script Extraction & Runner Targeting Fix
- Removed conflicting
gpulabel (was routing to 1-GPU boxes) - Extracted ~300 lines of inline shell/Python to
multi_gpu_host_launcher.shandaggregate_test_summary.py
Earlier: Container isolation, test_devices() API, Newton device-pinning, lazy pxr imports
| Ordered list of device strings as torch would address them. | ||
| """ | ||
| devices = ["cpu"] | ||
| if torch.cuda.is_available(): |
There was a problem hiding this comment.
Why are we using torch api here? This smells like tech debt. Shouldn't we switch to use warp? i.e.:
>>> import warp
>>> warp.init()
Warp 1.12.0 initialized:
CUDA Toolkit 12.9, Driver 13.0
Devices:
"cpu" : "x86_64"
"cuda:0" : "NVIDIA RTX PRO 6000 Blackwell Workstation Edition" (95 GiB, sm_120, mempool enabled)
Kernel cache:
/home/pbarejko/.cache/warp/1.12.0
>>> warp.get_cuda_devices()
['cuda:0']There was a problem hiding this comment.
Checked. using warp requires initializing runtime which might have side-effect to kit. torch does not have that.
| ], | ||
| ) | ||
| @pytest.mark.parametrize("device", ["cuda:0", "cpu"]) | ||
| @pytest.mark.parametrize("device", cuda_test_devices()) |
There was a problem hiding this comment.
Based on function name cuda_test_devices have we started to exclude cpu device tests. If so - why?
There was a problem hiding this comment.
renamed to test_devices
| so adopting this helper has zero impact on single-GPU runs. | ||
| """ | ||
|
|
||
| _ENV_VAR = "ISAACLAB_TEST_DEVICES" |
There was a problem hiding this comment.
Name of this global variable makes no sense, besides it's used in one place only. This unnecessary indirection makes code harder to read.
| reason="Multi-GPU tests disabled (set ISAACLAB_TEST_MULTI_GPU=1 to enable)", | ||
| ) | ||
| @pytest.mark.parametrize("device", ["cuda:1"]) | ||
| @pytest.mark.parametrize("device", cuda_test_devices(mask="001", strict=False)) |
There was a problem hiding this comment.
I don't understand the premise behind strictness in this function. I read it as maybe run on GPU 1, if unavailable then run on what's available. This introduces testing ambiguity, Just by looking at this line I don't know what you are trying to test.
In my opinion we would be better without cuda_test_devices function and keep list explicit, instead of resolved at runtime. In testing you want to keep constraints explicit.
There was a problem hiding this comment.
Strict is removed.
The goal of this is to dynamically configure the parameterization at test time so we run it where suitable. On regular test, cpu and cuda:0; and cuda:X for multi-gpu. The overall goal is to mark all tests needing device parameterization with this.
Per per-PR minimum-needed analysis: - isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have; isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot). Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in AppLauncher; drops the workflow env var. - isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of this diagnostic bundle to test whether isaac-sim#5933 alone is enough for newton test_articulation (which calls build_simulation_context(sim_cfg=, device=) at line 2427, so still needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton test_articulation kitless conversion and the schemas.py _create_fixed_joint_to_world helper. Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit XML path-collision fix in conftest. If green, confirms only 4 PRs are needed for multi-GPU CI green (with test_articulation un-gated).
There was a problem hiding this comment.
Update (6a0e473): 🐳 Major Refactor — 1-Docker N-Shard Architecture + External Runner Script
(Incremental follow-up to previous review — 45+ commits since a1fa279)
Key Architectural Change: 1-Docker, N-Shard
Before: Each GPU shard ran in its own isolated Docker container
After: One container hosts all pytest shards as parallel subshells
┌─────────────────────────────────────────────────────────┐
│ Single Docker Container (--gpus all) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Shard cuda:1 │ │ Shard cuda:2 │ │ Shard cuda:3 │ │
│ │ (subshell) │ │ (subshell) │ │ (subshell) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ↓ ↓ ↓ │
│ └────────────────┴────────────────┘ │
│ │ │
│ Shared PYTHONUSERBASE │
│ (junitparser, pytest, etc.) │
└─────────────────────────────────────────────────────────┘
Benefits:
- Eliminates cross-container races on shared workspace mount
- ~30s saved per shard (Docker init overhead removed)
- Shared pip install (single
junitparseretc. install vs N) - Simpler debugging (all shards in one container)
New: tools/multi_gpu_shard_runner.sh
The inside-container logic is now extracted to a standalone script:
Responsibilities:
- Sets up
HOME/PYTHONUSERBASEon tmpfs - Installs pytest deps into shared site-packages
- Auto-discovers GPU count via
nvidia-smi -L - Cross-checks against
torch.cuda.device_count() - Fans out 1 pytest subshell per non-default GPU
- Per-shard isolation: separate
HOME,ISAACLAB_TEST_DEVICES - Wait-all-then-aggregate exit codes
✅ Good extraction — versioned, shellcheck-able, reusable for local probing
Other Notable Changes
1. Kit GPU Pinning (ISAACLAB_PIN_KIT_GPU=1)
- Pins Kit renderer to single GPU when set
- Prevents multi-GPU race conditions in Kit's resource allocation
- Added to
app_launcher.pyand enabled in workflow
2. Directory-based Work Queue
- Uses atomic
os.rename()for claim contention - Reconciler step catches silent-drops
3. MIG Detection
- Handles both discrete GPUs and MIG slices
nvidia-smi -Lis authoritative (torch under-counts MIG)
4. DO-NOT-MERGE Gate
- Docs/install-ci workflows skip when PR title contains
DO-NOT-MERGE
Remaining Blockers
📋 Before merge:
- Revert
run_docker_tests: 'false'override inbuild.yaml— still hardcoded
The PR title contains DO-NOT-MERGE suggesting this is intentional for iteration, but the actual production-readiness blocker is that line in build.yaml.
🏁 Verdict: Looking good for the CI scope. The 1-docker architecture is a solid improvement. Just need to revert the run_docker_tests override before merge.
There was a problem hiding this comment.
Update (e6689f0): 📊 Aggregated Test Summary & Filtered Live Stdout
(Incremental follow-up to previous review)
1. New Aggregated Test Summary Step
A new always()-conditional step parses JUnit XMLs and outputs:
- Per-file results table: file path, shard, status, pass/total, skipped, failed, walltime
- Per-shard aggregate: files, pass/total, skipped, failed, test time
- Grand total row
Dual output:
- Plain text to stdout (for workflow log readability)
- Markdown to
$GITHUB_STEP_SUMMARY(renders at top of run page)
✅ Excellent addition:
- Provides at-a-glance CI results without scrolling through logs
- Per-shard breakdown helps diagnose load imbalance
- Status emojis in markdown (✅ PASS, ❌ FAIL, ⚪ EMPTY) improve scannability
- Clean separation of stdout (tabular text) vs Summary (markdown tables)
2. Filtered Live Stdout
The pytest output pipeline now includes a grep -aE filter that retains only high-signal lines:
Retained:
- 🚀 emoji markers (test boundaries)
- Test result lines (
PASSED,FAILED,ERROR,SKIPPED,XFAIL,XPASS) - Summary stats (
Total:,Passing:,Failing:, etc.) - Error markers (
^E,Traceback,^FAILED) - File references (
^ +File) - Pytest section dividers (
^=+,^~~~~)
Filtered out:
- Kit init chatter, plugin registration
- omni.usd Transfer logs
- Verbose pytest collection output
✅ Good improvement:
- Reduces live log noise while preserving actionable info
- Full output still available in
$shard_log(printed in grouped sections at job end) stdbuf -oLensures line-buffered output (no partial line issues)-aflag on grep handles binary (e.g., emoji bytes) correctly
3. Timeout Raised to 60 Minutes
timeout-minutes: 60 # was 45The diagnostic comment was removed, and timeout raised from 45 → 60 minutes.
ℹ️ Context: The previous 45-min timeout was flagged as diagnostic; 60 min gives more headroom for the parallel shards. Once load-balancing data is collected, this can likely be tightened back down.
4. Debug Comments (Review Notes)
Several #rj: comments were added throughout the workflow:
#rj: need to test using gpu runner. see if ecr is connected.#rj: this is useless here. should be just before running tests#rj: what happens if test file only has a few test_devices()...#rj: is this needed for all docker? or what's the problem here?#rj: does test fail if either queue or inflight is not empty?#rj: this probably should be in action libs?
5. Remaining Blockers
📋 Before merge:
- Revert
run_docker_tests: 'false'override inbuild.yaml— still pending - Remove
#rj:debug comments — review notes, not production docs
🏁 Verdict: Great additions for CI observability. The aggregated summary and filtered stdout significantly improve the developer experience for debugging multi-GPU test failures. Just need to clean up the debug comments and revert the docker test override.
Update (c514199): No changes to the aggregated test summary or filtered stdout logic in this increment. The new work-queue distribution in conftest.py complements these features well — JUnit XML paths now use full-path slugs to avoid collisions between shards. See primary review thread for full analysis.
Most test callers pass both ``sim_cfg=`` and ``device=`` to
:func:`isaaclab.sim.build_simulation_context`, implicitly expecting the
``device`` kwarg to win. The helper previously dropped the kwarg silently
when ``sim_cfg`` was provided, causing warp kernel-launch device
mismatches on non-default GPUs: the test fixture allocated ``env_ids``
on the requested device while the articulation's ``self.device``
resolved from the untouched ``sim_cfg`` default (``cuda:0``), and
``wp.launch(..., device=self.device)`` failed with::
RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
trying to launch on device='cuda:0',
but input array for argument 'env_ids' is on device=cuda:2.
Change ``device``'s default to ``None`` (sentinel) and apply it as an
override after sim_cfg construction in both branches. The one test that
asserted the old "sim_cfg overrides everything" contract is updated to
cover the new override semantics.
Add an ISAACLAB_PIN_KIT_GPU env var to AppLauncher. When truthy, it appends Kit command-line overrides that pin the renderer to a single GPU (renderer.multiGpu.enabled=False, autoEnable=False, maxGpuCount=1) and disable the fabric GPU-interop path (physics.fabricUseGPUInterop= false), so each Kit process touches only its assigned GPU instead of enumerating every visible GPU at startup. Used by the multi-GPU CI workflow to avoid a shared GPU-interop context across concurrent sibling shards, which otherwise surfaces as "Stage X already attached" errors and SimulationApp.close hangs (see isaac-sim#3475). Off by default; single-GPU and user-facing rendering paths are unchanged.
Adds the scope-intersect-runtime device selection helper (test_devices) and its unit test, so unit tests can declare the devices they are valid on and the multi-GPU lane can narrow them via ISAACLAB_TEST_DEVICES.
Pins torch and Warp to the target device before allocations and scopes the CUDA graph capture to it, so Newton runs correctly on cuda:1+ (issue isaac-sim#5132).
…test tools/conftest.py gains a directory-rename work queue (claim/inflight/done) for work-stealing across shards and a per-file report slug to avoid JUnit collisions between same-basename files. A workspace-root conftest skips non-device-parametrized tests on non-default cuda shards, since single-GPU CI already covers them on cuda:0.
…ions Adds an extra-env-vars input so the multi-GPU workflow can inject ISAACLAB_TEST_DEVICES / ISAACLAB_SIM_DEVICE into the test container.
Adds a workflow that runs the unit-test suite across non-default cuda devices in one container with N parallel pytest shards pulling from a shared work queue, plus the inside-container shard runner it mounts.
Switches device-parametrized unit tests to test_devices() so they also run on the non-default GPUs in the multi-GPU lane. Mechanical scope change only; no test logic changes.
…cher When the caller does not pass an explicit device, AppLauncher reads ISAACLAB_SIM_DEVICE and uses it as the device. Lets the multi-GPU CI lane boot Kit on a non-default GPU without editing every test's AppLauncher() call site.
Speeds up iteration on the multi-GPU lane: forces run_docker_tests=false in build.yaml and gates docs.yaml / install-ci.yml behind a DO-NOT-MERGE PR-title check. Revert this commit before the PR is merged.
e6689f0 to
c514199
Compare
There was a problem hiding this comment.
Update (c514199): 🎯 Final Rebase — Clean Commit Structure
The PR has been force-pushed with a cleanly rebased commit history. This is a well-structured changeset with logical, atomic commits:
Commit Structure Analysis
-
3ca93df—Honor device kwarg over sim_cfg.device in build_simulation_context- Fixes a subtle bug where
build_simulation_context(sim_cfg=..., device="cuda:N")silently ignored the explicit device kwarg - Essential for multi-GPU test fixtures that pass both
- Fixes a subtle bug where
-
a396a07—Add ISAACLAB_PIN_KIT_GPU to pin Kit renderer to one GPU- New env var for
AppLauncherto disable Kit's multi-GPU enumeration - Fixes the shared GPU-interop context issue (issue #3475) that caused hangs in concurrent shards
- New env var for
-
f1c0d18—[Tests] Add test_devices helper for device-parametrized unit tests- The core of this PR: elegant
scope ∩ runtimehelper atisaaclab.test.utils.test_devices() - Well-documented mask grammar (
11X,110,00X, etc.) - 148 lines of unit tests in
test_device_selection.pycovering the helper itself
- The core of this PR: elegant
-
878990a—[Newton] Fix Newton/Warp init on non-default CUDA devices- Pins torch + Warp to the target device before allocations (issue #5132)
- Clean fix with good comments explaining the issue
-
b6cf4cf—[CI] Add multi-GPU work-queue sharding and per-file reporting to conftest- Directory-based work queue with atomic
os.rename()for claim contention - Detailed per-test timing table in the summary
- Directory-based work queue with atomic
-
6cef107—[CI] Forward extra env vars through run-tests / run-package-tests actions- Clean mechanism to inject
ISAACLAB_TEST_DEVICES/ISAACLAB_SIM_DEVICEinto containers
- Clean mechanism to inject
-
71658df—[CI] Add multi-GPU pytest workflow- 455-line workflow with excellent documentation
- Auto-discovery of opt-in tests via
grep -rl 'test_devices' - 1-docker N-shard architecture with parallel subshells
- Aggregated test summary with per-shard breakdown
-
cef4569—[Tests] Parametrize unit tests over multi-GPU device scope- Migrates 200+
parametrize("device", [...])sites totest_devices() - Consistent migration pattern across isaaclab, isaaclab_newton, isaaclab_physx, isaaclab_ovphysx
- Migrates 200+
-
5715dcb—[App] Honor ISAACLAB_SIM_DEVICE as implicit default device in AppLauncher- Allows shards to boot Kit on a non-default GPU without editing every test's
AppLauncher()call site
- Allows shards to boot Kit on a non-default GPU without editing every test's
-
c514199—TEMP: skip docker/docs/install-ci CI while iterating #5823- Temporary commit for iteration (DO-NOT-MERGE gate in title)
Remaining Blockers
📋 Before merge:
- Revert
run_docker_tests: 'false'override in.github/workflows/build.yaml— currently hardcoded to skip Docker tests during iteration - Remove commit
c514199or revert its changes — the CI skips for docs/install-ci/Docker tests
The temporary CI skips are expected during PR iteration, and the DO-NOT-MERGE in the title is an appropriate gate.
Summary
This is a well-architected PR that introduces clean, maintainable multi-GPU testing infrastructure:
✅ Elegant test_devices() helper with intuitive mask grammar
✅ Proper device-pinning fixes for Newton/Warp on non-default GPUs
✅ Clean 1-docker N-shard workflow architecture
✅ Atomic commit structure with each change self-contained
✅ Extensive migration of existing tests (200+ sites)
✅ Unit tests for the helper function itself
The PR is ready for merge once the temporary CI overrides are reverted. 🚀
Update (3694e7f): Branch merged with latest develop. PR-specific changes:
- ✅ Blocker resolved: Reverted the temporary CI skips (commit
4441f681) — docs/Docker/install-ci tests no longer bypassed. - ✅ Blocker resolved: Reverted the
run-testsaction extra env-var forwarding (commit1e710b15) — cleaner approach adopted. - ✅ Improved architecture: Replaced the repo-root
conftest.pydevice-skip hook with a dedicated pytest plugin (mgpu_shard_select.py) injected only in the multi-GPU lane via-p(commitb17679c6). This is a better design — the plugin only affects multi-GPU shards and is invisible to normal single-GPU CI. The deselection (not skip) approach also avoids polluting test reports. - 📝 Added documentation/annotations for the multi-GPU bash scripts (commit
49aadc57).
Previous inline comments status:
strict=Trueconcern ondevices.py: The new plugin design makes this moot — the shard-selection logic now lives in an isolated plugin rather than a root conftest, and tests are deselected (not skipped) when out of scope. The originalstrict=Truebehavior intest_devices()is unchanged but no longer causes collection-time failures on non-GPU hosts because the plugin only activates whenISAACLAB_TEST_DEVICESexcludes cpu+cuda:0.run-tests/action.ymlcomment-stripping concern: The action changes were reverted entirely, so this is no longer applicable.
No new issues in the PR-specific commits. The bulk of the diff is develop-branch merge content (cloner refactor, task renames, OvPhysX runtime guards, etc.) which is not part of this PR's scope.
The test job's runs-on carried the `gpu` label, which in the self-hosted fleet tags single-GPU runners. Requiring both `gpu` and `multi-gpu` routed the job onto a single-GPU box, so the shard runner aborted with "Need at least 2 visible devices; found 1"; the over-constrained label also left the job queued for hours. Drop `gpu` so the job targets the multi-GPU pool.
Move the host-orchestration bash (symlink, work-queue seed, MIG
detection, docker run, reconciler) and the JUnit-XML aggregation Python
out of the workflow's inline run: blocks into version-controlled,
lint-able scripts under .github/actions/multi-gpu/, alongside the
existing multi_gpu_shard_runner.sh. The workflow drops from 456 to 181
lines and each step becomes a one-line call.
Behavior-preserving: data still flows via step env, $GITHUB_OUTPUT,
$GITHUB_ENV, and $GITHUB_STEP_SUMMARY; the container --name now uses
the runner built-ins $GITHUB_RUN_ID/$GITHUB_RUN_ATTEMPT in place of the
YAML-only ${{ github.run_id }} templating. Also documents why the test
job must not carry the gpu label.
The ECR cache repo is resolved per runner pool (single-GPU runners -> gitci-docker-cache; multi-GPU runners -> multigpu-docker-cache). Building on [self-hosted, gpu] pushed the image to gitci-docker-cache, which the multi-GPU test job cannot see, so it rebuilt from scratch (~27 min) on the multi-GPU runner inside its pull step. Build on the multi-GPU pool so the image lands in multigpu-docker-cache and the test job's pull hits.
Move non-default-shard device selection out of a repo-root conftest.py into a scoped pytest plugin (mgpu_shard_select) that tools/conftest.py injects per file only on multi-GPU shards. The plugin keys off ISAACLAB_TEST_DEVICES so keep/drop matches each test's device parametrization, deselects out-of-scope variants (cpu, cuda:0, other-index), and maps the all-deselected NO_TESTS_COLLECTED exit to OK. Confines the behavior to the lane instead of a global root conftest.
Annotate the uncommon bash idioms in the two multi-GPU orchestration scripts for non-bash readers. Remove the now-dead py-spy/gdb hang-capture remnants -- the SYS_PTRACE cap-add and the py-spy pip install -- along with the matching changelog bullet, since the conftest-side capture they supported was removed.
…ests actions" This reverts commit 6cef107.
…n_context (#5881) ## Summary - `build_simulation_context(device="cuda:N", sim_cfg=...)` silently dropped the explicit `device` kwarg when a `sim_cfg` was passed, falling back to `sim_cfg.device` (default `cuda:0`). - The multi-GPU CI lane sets `ISAACLAB_SIM_DEVICE=cuda:N` per shard, so tests that pass `device="cuda:N"` got `cuda:0` instead. Downstream Warp kernels then ran on `cuda:0` while the rest of the test believed it was on `cuda:N`: ``` RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index', trying to launch on device='cuda:0', but input array for argument 'env_ids' is on device=cuda:2. ``` - Fix: make `device`'s default `None` (sentinel) and apply it as an override after `sim_cfg` is resolved, so an explicit kwarg wins whether or not a `sim_cfg` was supplied. ## 1. The bug ```python def build_simulation_context(..., device: str = "cuda:0", sim_cfg=None, ...): if sim_cfg is None: sim_cfg = SimulationCfg(device=device, ...) # ^^ explicit `device` only used in the no-sim_cfg path; otherwise ignored ``` When a caller passed both `sim_cfg=<built with default device>` and `device="cuda:2"`, the kwarg was thrown away. Code that pulled the active device from `sim_cfg.device` saw `cuda:0`; Warp arrays allocated against the cfg device landed on `cuda:0` while torch ops driven by the kwarg ran on `cuda:2` — the cross-device kernel-launch error above. ## 2. Fix ```python def build_simulation_context(..., device: str | None = None, sim_cfg=None, ...): if sim_cfg is None: gravity = (0.0, 0.0, -9.81) if gravity_enabled else (0.0, 0.0, 0.0) sim_cfg = SimulationCfg(dt=dt, gravity=gravity) if device is not None: sim_cfg.device = device # explicit kwarg wins in both branches ``` `device=None` (default) means "use whatever the cfg already has". `device="cuda:N"` is honored even when a cfg is also passed. ## 3. Validation `source/isaaclab/test/sim/test_build_simulation_context_{headless,nonheadless}.py::test_build_simulation_context_cfg` is updated to assert the new override semantics (explicit `device` wins over `sim_cfg.device`). On local multi-GPU/MIG hardware, `build_simulation_context(sim_cfg=cfg, device="cuda:N")` previously hit the kernel-launch assertion; with the fix it runs on the requested device. Consumed by the multi-GPU CI lane (#5823).
# Conflicts: # source/isaaclab_ovphysx/test/assets/test_articulation.py # source/isaaclab_ovphysx/test/assets/test_rigid_object.py # source/isaaclab_ovphysx/test/assets/test_rigid_object_collection.py # tools/conftest.py
1. Summary
isaaclab.test.utils.test_devices()— resolves a test's parametrized device list asscope ∩ runtime: the call-site mask declares the devices a test is valid on (default"11X"); theISAACLAB_TEST_DEVICESenv var declares the devices a run may use (default"110"⇒ cpu + cuda:0).ISAACLAB_SIM_DEVICE, honored byisaaclab.app.AppLauncheras the implicit-default boot device, so the lane can boot Kit on a non-default GPU without editing everyAppLauncher()call site.cuda:N.test-multi-gpu-pytest.yamlruns the device-parametrized unit-test suite across the non-default GPUs and migrates ~20 test files onto the helper. Single-GPU CI behavior is unchanged.Chained on top of #5881 (device-kwarg fix) and #5933 (
ISAACLAB_PIN_KIT_GPU); both are required for the lane to pass and will drop out of this branch's diff once they land.2. Device selection:
scope ∩ runtimeA test declares scope; the operator/CI sets the runtime; the executed set is the intersection. The author never names the shard's GPU; the operator never inspects a test's device support.
2.1 Mask grammar
Positions left to right:
0= cpu,1= cuda:0 (default GPU), then each remaining position is a non-default GPU. Each is0/1; a trailingXmeans "include all remaining devices". cpu, cuda:0, and a non-default GPU stay distinct by position — passing on cuda:0 says nothing about cuda:1+, which is the point.11011X00X100test_devices()defaults to"11X", so the device-agnostic common case needs no argument. An empty result means the test is cleanly skipped for that run (e.g. a"00X"test on a single-GPU host). The helper raises only when the run explicitly setISAACLAB_TEST_DEVICESto a device the host lacks (a misconfigured shard), so a runner with fewer GPUs than expected fails loudly instead of reporting a vacuous green.3. Workflow shape (1-docker, N shards)
config→build(ECR cache, on[self-hosted, linux, x64, multi-gpu]so the per-pool image cache it populates is the onetestpulls) →test(on[self-hosted, linux, x64, multi-gpu]). The test job launches one container that hosts N parallel pytest subshells, one per non-defaultcuda:N. Each subshell pinsISAACLAB_SIM_DEVICE/ISAACLAB_TEST_DEVICESto its own GPU and pulls test files from a shared directory-based work queue (claim via atomicos.renamefromqueue/→inflight/<shard>/→done/<shard>/; no flock). A job-end reconciler asserts the queue and in-flight sets are empty, so a silently dropped file fails the job. This single-container shape removes the cross-container races on the shared workspace mount and the per-shard docker-init overhead of a container-per-GPU layout. The inside-container runner lives at.github/actions/multi-gpu/multi_gpu_shard_runner.sh.Inside each shard,
tools/conftest.pyruns./isaaclab.sh -p -m pytestone file at a time (subprocess-per-file), with a per-file JUnit report slug to avoid collisions between same-basename files across packages. A scoped pytest plugin (.github/actions/multi-gpu/mgpu_shard_select.py, injected per file bytools/conftest.pyonly on the non-default shards) deselects everything but the in-scopedevicevariant at collection time — non-device-parametrized tests and the cpu/cuda:0 variants both drop, since single-GPU CI already covers those on cuda:0.4. Single-GPU CI invariant
Migrated tests behave identically on the single-GPU pool: with
ISAACLAB_TEST_DEVICESunset the runtime is"110", so an argless /"11X"scope resolves to[cpu, cuda:0]— the same set tests carried before.ISAACLAB_SIM_DEVICEunset ⇒AppLauncherdefaults tocuda:0. Both env vars take effect only on the multi-GPU workflow.5. Notes
test_articulationcases pinned to cuda:0 (device-mismatch onwrite_root_*_pose_to_simand Newton init oncuda:N) are addressed by the chained [MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context #5881 + the Newton init fix here; those tests now run and pass on the non-default GPUs.devicevalue, so even literalparametrize("device", [...])sites have their cpu/cuda:0 variants dropped on shards; sites that don't yet use the helper simply gain nocuda:Nvariant, and migrating them totest_devices()for non-default-GPU coverage is a separate pass.