Skip to content

fix(agent-server): run bash cleanup traps on timeout via process-group SIGTERM#3175

Merged
VascoSch92 merged 1 commit into
mainfrom
3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm
May 11, 2026
Merged

fix(agent-server): run bash cleanup traps on timeout via process-group SIGTERM#3175
VascoSch92 merged 1 commit into
mainfrom
3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm

Conversation

@VascoSch92
Copy link
Copy Markdown
Contributor

@VascoSch92 VascoSch92 commented May 9, 2026

  • A human has tested these changes.

Summary

When a bash command exceeds its timeout, the agent-server sent SIGKILL before SIGTERM, so user-installed traps (trap 'cleanup' EXIT, etc.) never ran — leaving stale PID files, temp dirs, and connections behind.

Just swapping the order isn't enough: asyncio.create_subprocess_shell runs the command under a shell, and process.terminate() only signals the shell. While the shell is blocked waiting on a foreground child (e.g. sleep), it cannot
dispatch its trap until the child exits. Signaling only the shell leaves the trap pending forever, then SIGKILL on the shell still bypasses it.

Fix: spawn the command in its own session (start_new_session=True) and signal the whole process group via os.killpg — SIGTERM first, escalating to SIGKILL after a 1s grace period. The shell's children die, the shell unblocks, and
the trap runs. This matches the pattern already used in openhands-sdk/.../hooks/executor.py and openhands-tools/.../subprocess_terminal.py.

Issue Number

Closes #3120.

How to Test

  • tests/agent_server/test_bash_service.py::test_bash_timeout_runs_sigterm_trap now passes (xfail removed)
    • tests/agent_server/stress/test_long_running_command.py::test_bash_timeout_kills_process_cleanly still passes — no zombie descendants after timeout-kill
    • tests/agent_server/stress/test_long_running_command.py::test_long_running_bash_does_not_block_event_loop still passes
    • pre-commit clean (ruff, pyright, dep rules)

Type

  • Bug fix
  • Feature
  • Refactor
  • Breaking change
  • Docs / chore

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:4679f60-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-4679f60-python \
  ghcr.io/openhands/agent-server:4679f60-python

All tags pushed for this build

ghcr.io/openhands/agent-server:4679f60-golang-amd64
ghcr.io/openhands/agent-server:4679f600027de58844b62dcdc1d64bd3b69eac41-golang-amd64
ghcr.io/openhands/agent-server:3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm-golang-amd64
ghcr.io/openhands/agent-server:4679f60-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:4679f60-golang-arm64
ghcr.io/openhands/agent-server:4679f600027de58844b62dcdc1d64bd3b69eac41-golang-arm64
ghcr.io/openhands/agent-server:3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm-golang-arm64
ghcr.io/openhands/agent-server:4679f60-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:4679f60-java-amd64
ghcr.io/openhands/agent-server:4679f600027de58844b62dcdc1d64bd3b69eac41-java-amd64
ghcr.io/openhands/agent-server:3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm-java-amd64
ghcr.io/openhands/agent-server:4679f60-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:4679f60-java-arm64
ghcr.io/openhands/agent-server:4679f600027de58844b62dcdc1d64bd3b69eac41-java-arm64
ghcr.io/openhands/agent-server:3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm-java-arm64
ghcr.io/openhands/agent-server:4679f60-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:4679f60-python-amd64
ghcr.io/openhands/agent-server:4679f600027de58844b62dcdc1d64bd3b69eac41-python-amd64
ghcr.io/openhands/agent-server:3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm-python-amd64
ghcr.io/openhands/agent-server:4679f60-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:4679f60-python-arm64
ghcr.io/openhands/agent-server:4679f600027de58844b62dcdc1d64bd3b69eac41-python-arm64
ghcr.io/openhands/agent-server:3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm-python-arm64
ghcr.io/openhands/agent-server:4679f60-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:4679f60-golang
ghcr.io/openhands/agent-server:4679f600027de58844b62dcdc1d64bd3b69eac41-golang
ghcr.io/openhands/agent-server:3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm-golang
ghcr.io/openhands/agent-server:4679f60-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:4679f60-java
ghcr.io/openhands/agent-server:4679f600027de58844b62dcdc1d64bd3b69eac41-java
ghcr.io/openhands/agent-server:3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm-java
ghcr.io/openhands/agent-server:4679f60-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:4679f60-python
ghcr.io/openhands/agent-server:4679f600027de58844b62dcdc1d64bd3b69eac41-python
ghcr.io/openhands/agent-server:3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm-python
ghcr.io/openhands/agent-server:4679f60-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 4679f60-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 4679f60-python-amd64) are also available if needed

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-agent-server/openhands/agent_server
   bash_service.py1973084%71–73, 142–143, 145–146, 176–178, 191–194, 274, 279–280, 301–306, 333–334, 336, 343–344, 374–375
TOTAL262371151356% 

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! ✅

Clean fix for the trap execution bug. The process-group signaling approach is correct and follows established patterns in the codebase.

Risk: 🟢 LOW - Bug fix with existing test coverage, no agent behavior changes.

@VascoSch92 VascoSch92 marked this pull request as ready for review May 9, 2026 13:52
@VascoSch92 VascoSch92 requested a review from csmith49 May 9, 2026 13:52
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

Clean fix for the cleanup trap bug. The process-group signaling approach is correct and follows the established pattern already used in hooks/executor.py and subprocess_terminal.py.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

Bug fix with existing test coverage. The fix makes bash timeout handling behave correctly (allowing cleanup traps to run), and stress tests verify no zombie processes or event loop blocking. No agent decision-making logic affected.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified the fix works end-to-end by running actual bash commands with TERM traps that execute on timeout.

Does this PR achieve its stated goal?

Yes. The PR fixes the bug where bash cleanup traps didn't run on timeout. Previously, the agent-server sent SIGKILL before SIGTERM, preventing user-installed traps from executing. The fix spawns commands in a new session and signals the entire process group with SIGTERM first (1s grace period), then escalates to SIGKILL only if needed. Real-world testing confirms that cleanup traps now execute successfully when commands timeout.

Phase Result
Environment Setup uv sync --dev completed successfully
CI Status ✅ All critical checks passing (agent-server-tests, sdk-tests, pre-commit, coverage-report)
Functional Verification ✅ TERM traps execute on timeout; no regressions in stress tests
Functional Verification

Test 1: Verify the bug existed on main

Step 1 — Establish baseline (main branch without fix):

Checked out main branch and inspected the test:

$ git checkout main
$ head -60 tests/agent_server/test_bash_service.py | tail -20

The test was marked @pytest.mark.xfail with reason:

@pytest.mark.xfail(
    strict=True,
    reason=(
        "bash_service.py:276 sends SIGKILL before SIGTERM, so cleanup "
        "traps never run. Fix: terminate() → wait → kill() (see "
        "vscode_service.py:78-84). "
        "Tracked in https://github.com/OpenHands/software-agent-sdk/issues/3120."
    ),
)

This confirms the bug was known and the test was expected to fail.

Step 2 — Verify the bug in code:

Inspected bash_service.py on main (lines 274-282):

except TimeoutError:
    # Kill the process if it times out
    process.kill()  # Line 276: SIGKILL sent FIRST
    try:
        # Give the process a short time to die gracefully
        await asyncio.wait_for(process.wait(), timeout=1.0)
    except TimeoutError:
        # If it still won't die, terminate it more forcefully
        process.terminate()  # SIGTERM sent as fallback (too late)

This shows the bug: process.kill() (SIGKILL) is sent immediately, making traps uncatchable. process.terminate() (SIGTERM) only runs if SIGKILL somehow doesn't work, which is backwards.

Ran the test on main:

$ uv run pytest tests/agent_server/test_bash_service.py::test_bash_timeout_runs_sigterm_trap -v
...
test_bash_service.py::test_bash_timeout_runs_sigterm_trap XFAIL (bash_service.py:276 sends SIGKILL before SIGTERM...)
1 xfailed, 5 warnings in 2.54s

The test confirms expected failure (XFAIL).


Test 2: Verify the fix works on PR branch

Step 3 — Apply the PR's changes:

Checked out PR branch:

$ git checkout 3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm

Inspected the fixed code in bash_service.py (lines 212-226, 295-309):

Key change 1 — New session for process group signaling:

process = await asyncio.create_subprocess_shell(
    command.command,
    ...
    start_new_session=True,  # NEW: spawn in own session
)

Key change 2 — Helper method for process group signals:

def _signal_process_group(
    self,
    process: asyncio.subprocess.Process,
    sig: signal.Signals,
    command: str,
) -> None:
    try:
        os.killpg(os.getpgid(process.pid), sig)  # Signal whole group
    except ProcessLookupError:
        pass

Key change 3 — SIGTERM before SIGKILL with grace period:

except TimeoutError:
    # Send SIGTERM to the whole process group so user-installed
    # cleanup traps can run, then escalate to SIGKILL if needed.
    self._signal_process_group(process, signal.SIGTERM, command.command)
    try:
        await asyncio.wait_for(process.wait(), timeout=1.0)
    except TimeoutError:
        self._signal_process_group(process, signal.SIGKILL, command.command)

This is exactly the fix described in the PR: SIGTERM first, 1s grace period, then SIGKILL.

Step 4 — Verify with unit test:

Ran the previously-failing test:

$ uv run pytest tests/agent_server/test_bash_service.py::test_bash_timeout_runs_sigterm_trap -v
...
tests/agent_server/test_bash_service.py::test_bash_timeout_runs_sigterm_trap PASSED [100%]
1 passed, 5 warnings in 1.37s

✅ The test now PASSES (xfail decorator removed, trap executes successfully).

Step 5 — Verify with functional demonstration:

Created and ran a standalone script that:

  1. Creates a bash command with trap 'touch /tmp/marker; exit 0' TERM
  2. Sleeps for 30 seconds with a 1-second timeout
  3. Verifies the marker file is created (proving the trap ran)
$ uv run python /tmp/test_trap_on_timeout.py
Marker file path: /tmp/tmp6c1n35ef/cleanup_ran

Executing command: trap 'touch /tmp/tmp6c1n35ef/cleanup_ran; exit 0' TERM; sleep 30
Timeout: 1s

Waiting for command to timeout and TERM trap to run...

✅ SUCCESS: TERM trap executed successfully!
   The marker file cleanup_ran was created by the trap handler.
   This proves that SIGTERM was sent to the process group,
   allowing the shell's trap to run before the process was killed.

✅ Real-world execution confirms the fix works.


Test 3: Verify no regressions in stress tests

Ran the stress tests mentioned in the PR description:

Test 3a — Clean process kill:

$ uv run pytest tests/agent_server/stress/test_long_running_command.py::test_bash_timeout_kills_process_cleanly -v -m stress
...
test_bash_timeout_kills_process_cleanly PASSED [100%]
1 passed, 5 warnings in 1.33s

Test 3b — Event loop not blocked:

$ uv run pytest tests/agent_server/stress/test_long_running_command.py::test_long_running_bash_does_not_block_event_loop -v -m stress
...
test_long_running_bash_does_not_block_event_loop PASSED [100%]
1 passed, 5 warnings in 5.32s

✅ Both stress tests pass. No regressions in process cleanup or async behavior.

Issues Found

None.

Copy link
Copy Markdown
Collaborator

@csmith49 csmith49 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I've ever seen the review bot do a "LGTM" before!

@VascoSch92 VascoSch92 merged commit a132837 into main May 11, 2026
51 of 52 checks passed
@VascoSch92 VascoSch92 deleted the 3120-agent-server-bash-cleanup-traps-dont-run-because-sigkill-precedes-sigterm branch May 11, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

agent-server: bash cleanup traps don't run because SIGKILL precedes SIGTERM

3 participants