Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Fixed
^^^^^

* Handled ``SIGHUP`` in :class:`~isaaclab.app.AppLauncher` so the
simulation app shuts down cleanly when the controlling session leader
exits (e.g. parent shell supervising sibling shards in multi-GPU CI).
Previously SIGHUP terminated the process with default disposition,
bypassing :meth:`SimulationApp.close` and leaving USD/PhysX state
attached for the next sibling shard ("Stage X already attached" log
line and downstream shutdown hangs).
* Made :meth:`~isaaclab.app.AppLauncher._abort_signal_handle_callback`
exit the process after closing the app. The previous implementation
swallowed the signal's terminate semantics, allowing Python to
resume past a SIGTERM/SIGABRT/SIGSEGV and leaving Kit half-torn-down.

Added
^^^^^

* Added ``ISAACLAB_FORCE_EXIT_TIMEOUT`` env var (integer seconds) for
:class:`~isaaclab.app.AppLauncher`. When set, arms a daemon thread that
fires SIGKILL on the current process via a raw libc ``kill(2)`` syscall
through ``ctypes`` after the deadline, from both the ``atexit`` hook
and the abort signal handler. Used by CI to bound the upstream Kit
shutdown hang inside ``quickReleaseFrameworkAndTerminate`` (see
https://github.com/isaac-sim/IsaacLab/issues/3475, NVBug 5948099 /
OMPE-75416). The hang is a GIL deadlock — Python's bytecode dispatcher
cannot run while Kit's teardown C++ frames hold the GIL — so
``os._exit`` (which needs the dispatcher) cannot reliably fire from a
daemon thread; the ctypes-issued libc syscall releases the GIL across
the C call and is the only python-side approach that fires under the
worst case. Off by default — interactive user code still gets the
graceful teardown.
113 changes: 106 additions & 7 deletions source/isaaclab/isaaclab/app/app_launcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,69 @@
logging.getLogger("h5py").setLevel(logging.WARNING)


def _parse_force_exit_seconds(value: str) -> int:
"""Parse the ``ISAACLAB_FORCE_EXIT_TIMEOUT`` env var.

Returns 0 (no timer) for unset/empty/falsy values, otherwise a positive
integer of seconds. Invalid integers also return 0 with a warning so a
typo cannot accidentally arm a force-exit.
"""
raw = (value or "").strip()
if not raw or raw.lower() in {"0", "false", "no", "off"}:
return 0
try:
n = int(raw)
except ValueError:
logger.warning("ISAACLAB_FORCE_EXIT_TIMEOUT=%r is not an integer; ignoring", raw)
return 0
return max(0, n)


def _arm_force_exit_timer(seconds: int, label: str) -> None:
"""Spawn a daemon thread that calls ``os._exit(0)`` after ``seconds``.

Bounds the Kit ``quickReleaseFrameworkAndTerminate`` hang in CI
environments (https://github.com/isaac-sim/IsaacLab/issues/3475 / NVBug
5948099 / OMPE-75416).

Uses ``ctypes.CDLL("libc.so.6").kill(os.getpid(), SIGKILL)`` rather than
``os._exit(0)`` because the hang in question is a GIL deadlock: Kit's
teardown C++ frames hold the GIL while joining ``carb.tasking`` worker
threads that themselves wait for the GIL. ``os._exit`` is a Python-level
syscall wrapper that requires the bytecode dispatcher (which needs the
GIL); if the daemon thread cannot acquire the GIL, the wake never runs.
``ctypes`` releases the GIL around C calls by default, so the raw libc
``kill(2)`` syscall fires even when the interpreter is GIL-blocked.

The daemon thread is created (and parks on ``time.sleep``, which releases
the GIL) BEFORE the close() call so it is already in the GIL-released
sleep state when the deadlock starts. SIGKILL ends the process; no
atexit, no JUnit cleanup runs — callers must have already written their
JUnit XML before triggering close().
"""
import ctypes
import threading

def _timeout_kill() -> None:
import time as _time

_time.sleep(seconds)
sys.stderr.write(
f"[isaaclab.app] ISAACLAB_FORCE_EXIT_TIMEOUT={seconds}s expired during {label}; libc.kill(self, SIGKILL)\n"
)
sys.stderr.flush()
try:
libc = ctypes.CDLL("libc.so.6", use_errno=True)
# SIGKILL is 9 on Linux. Raw syscall — does not require the GIL
# to be dispatchable through the interpreter.
libc.kill(os.getpid(), 9)
except Exception:
# Last-resort fallback: only works when GIL is reachable.
os._exit(0)

threading.Thread(target=_timeout_kill, name="isaaclab-force-exit", daemon=True).start()


class ExplicitAction(argparse.Action):
"""Custom action to track if an argument was explicitly passed by the user."""

Expand Down Expand Up @@ -344,19 +407,43 @@ def __init__(self, launcher_args: argparse.Namespace | dict | None = None, **kwa
# Ensure SimulationApp.close() is called on normal process exit so Kit
# shuts down cleanly instead of relying on __del__ (which logs a warning
# and can leave GPU resources in a bad state for the next test).
def _atexit_close(app=self._app):
#
# ``ISAACLAB_FORCE_EXIT_TIMEOUT`` (env var, integer seconds) arms a
# daemon thread that calls ``os._exit(0)`` after the deadline. Used by
# CI to bound the Kit shutdown hang
# (https://github.com/isaac-sim/IsaacLab/issues/3475). The hang sits
# inside ``shutdown_and_release_framework`` (Kit binary path), which
# ``skip_cleanup=True`` also enters and Python signal handlers cannot
# interrupt — only ``os._exit`` actually returns control. Unset/empty/0
# means "wait indefinitely", the current upstream behavior and the
# right default for interactive user code that wants graceful teardown.
force_exit_seconds = _parse_force_exit_seconds(os.environ.get("ISAACLAB_FORCE_EXIT_TIMEOUT", ""))
self._force_exit_seconds = force_exit_seconds

def _atexit_close(app=self._app, force_exit_seconds=force_exit_seconds):
if force_exit_seconds > 0:
_arm_force_exit_timer(force_exit_seconds, "atexit_close")
with contextlib.suppress(Exception):
app.close()

atexit.register(_atexit_close)

# Set up signal handlers for graceful shutdown
# -- during explicit `kill` commands
# Set up signal handlers for graceful shutdown. Kit launches with
# ``--/app/installSignalHandlers=0``, so it's on us to drive ``app.close()``
# before the default disposition terminates the process and leaves
# USD / PhysX state attached (the next sibling shard then trips
# "Stage X already attached" or hangs on physx detach).
# -- during explicit ``kill`` commands
signal.signal(signal.SIGTERM, self._abort_signal_handle_callback)
# -- during aborts
signal.signal(signal.SIGABRT, self._abort_signal_handle_callback)
# -- during segfaults
signal.signal(signal.SIGSEGV, self._abort_signal_handle_callback)
# -- when the controlling session leader (e.g. a parent shell that
# supervises sibling shards) exits: SIGHUP cascades to children, and
# without a handler the default action would terminate before
# ``_atexit_close`` could run.
signal.signal(signal.SIGHUP, self._abort_signal_handle_callback)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Suggestion: Platform guard for SIGHUP

signal.SIGHUP is not available on Windows. If Isaac Lab ever supports Windows (or if someone imports this module on Windows for testing), this line would raise AttributeError.

A minimal guard:

if hasattr(signal, 'SIGHUP'):
    signal.signal(signal.SIGHUP, self._abort_signal_handle_callback)

Low priority since Isaac Lab targets Linux, but good defensive practice.


"""
Properties.
Expand Down Expand Up @@ -1374,7 +1461,19 @@ def _hide_play_button(self, flag):
play_button_group._play_button.visible = not flag # type: ignore
play_button_group._play_button.enabled = not flag # type: ignore

def _abort_signal_handle_callback(self, signal, frame):
"""Handle the abort/segmentation/kill signals."""
# close the app
self._app.close()
def _abort_signal_handle_callback(self, signum, frame):
"""Handle the abort/segmentation/kill/hangup signals.

Closes :class:`SimulationApp` (honoring ``ISAACLAB_FORCE_EXIT_TIMEOUT``
if set) so Kit detaches USD/PhysX state, then exits with
``128 + signum`` to preserve the conventional signal-exit encoding.
Without the explicit exit, Python would resume execution after the
handler returns (since we replaced the OS-default disposition), and
Kit would be left half-torn-down.
"""
force_exit_seconds = getattr(self, "_force_exit_seconds", 0)
if force_exit_seconds > 0:
_arm_force_exit_timer(force_exit_seconds, f"abort_signal_{signum}")
with contextlib.suppress(Exception):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Suggestion: Double-close with _atexit_close

sys.exit() raises SystemExit → Python runs atexit handlers → _atexit_close calls app.close() again. Both call sites are guarded by contextlib.suppress(Exception), so this is safe but redundant.

If SimulationApp.close() is not idempotent and logs warnings on double-close (some Kit versions do), you could either:

  1. Call atexit.unregister(_atexit_close) before sys.exit() (requires making _atexit_close accessible, e.g. storing it as self._atexit_close), or
  2. Add a guard flag (e.g. self._app_closed = True) checked by _atexit_close.

Not blocking — the suppress handles it — but worth considering if you see noisy double-close warnings in CI logs.

self._app.close()
sys.exit(128 + signum)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Warning: sys.exit() may be unsafe after SIGSEGV

sys.exit() raises SystemExit, which unwinds the stack through finally blocks and triggers atexit handlers. After a segmentation fault, the process heap/stack may be corrupted, making this unwinding dangerous (it could deadlock in a finalizer or corrupt data further).

For SIGSEGV specifically, os._exit(128 + signum) would be safer — it terminates immediately without running cleanup hooks. You could branch:

with contextlib.suppress(Exception):
    self._app.close()
if signum == signal.SIGSEGV:
    os._exit(128 + signum)
sys.exit(128 + signum)

Alternatively, if you want to keep it simple and accept the risk (SIGSEGV in Python-land is already UB), the current approach is defensible — just noting the trade-off.